I’ve been working on a number of technical projects recently and wanted to share a small solution to an issue that doesn’t always have a great solution.
Recently we rolled out a feature called “Games You’ll Love”, which is just one front-end element of a much larger project.
This feature provides personalized game recommendations based off product correlation data and historical user behavior. If you visit a page without any history, you’ll find a series of recommendations based off the genre or game you’re looking at, but those are based off aggregate data. If you’re a user, these recommendations start to become personalized and may even strip your purchase history so you’re not being recommended irrelevant content.
Part of the requirements were that each element is actually tracked, so that the system making these recommendations can learn and improve. It also allows for A/B testing of different logic on the backend. However, to do this, the URLs must contain URL parameters. There are some other solutions I’m looking at, but the requirement at the time was that the landing URL had to contain the parameters on page load to capture this data.
This feature was both helpful and challenging from an SEO perspective.
- Pro: The GYL feature provides an internal linking mechanism, which would cross link related content across the site, in a very logical manner. (Very similar to a related post plugin, but more sophisticated than just keyword matching.)
- Con: Created a near infinite number of duplicate URLs by appending a large combination of parameters on the end of the URLs.
It linked to URLs internally in a fashion similar to this:
This is a bit of a canonicalization, duplication, and crawl budget issue. The canonical tag is one way of solving this, so is simply robots.txt blocking the parameters. However, I didn’t want to depend solely on the canonical tag to solve this and the robots.txt solution breaks the internal linking benefit of the related links.
I wanted clean URLs in the HTML code. What to do?
Some Solve This With Cloaking
This isn’t a new or uncommon problem. Major ecommerce websites like Amazon face this same problem with their product recommendation system.
If you go to the World War Z product page on Amazon, you’ll see a “Customers Who Bought This Item Also Bought” feature.
If you view source, you’ll see this.
That ref=pd_sim_b_1 is the concatenated tracking parameter. I’m assuming it’s communicating product similar feature, position one, and some other variable system which it denotes with the “b”. I don’t know exactly what the “b” is for, but I can make some guesses.
However, switch your user agent to GoogleBot and and refresh the page. No tracking parameters…
The version of the site Google is crawling doesn’t have this issue with the dirty internal links.
I think this is a good example of “whitehat” cloaking, even though Google says there is no such thing. This cloaking example is well-known.
I wanted to find a way to solve this without having to cloak.
I’m looking into a better cookie based solution, but for right now this is one way we solved it. It’s not perfect and has its own set of downsides. However, the URL in the HTML is clean without cloaking, and we still manage to pass the parameters in a way that works within the requirements of the system.
If you look at the code for GYL, you’ll see:
The link is clean, but there is a data-trk attribute, which stores all of the tracking parameters in JSON.
This data is grabbed on the outbound click and used to rebuild the URL on the way out of the page.
And for example, the JSON data can be parsed out using regex:
And this type of data can be appended to URLs and redirected:
I’d love to hear any thoughts on this approach.