Canonicalization has always been an issue for SEOs. Years ago Google was notoriously bad at working out some pretty basic issues, but these days it’s not too bad with the simple issues, and provides easy tools to help sort them. Gone at the days when site.com, www.site.com, www.site.com/index.php etc were the biggest canonicalization problems. They are rarely problems these days and easily fixed.
However, there are many other problems which are common (and some less so) but often over looked. It’s always been an issue, and with duplicate content being a big issue in Panda, it’s now doubly time to sort out those issues.
Just to be clear, a canonicalization problem is when there are multiple versions of the same page, or close variations of it, on the same domain. That’s a type of duplicate content, although duplicate content also covers across domains. Cross domain duplicate content isn’t a canonicalization problem (usually) so I’ll be excluding that. The word “canonical” means definitive, the original, the main version. So when there’s two pieces of identical or similar content, the canonical version is the one you want Google to know about. The other needs to be either removed or an instruction given that it is not the canonical version.
Here’s a few problems that I run into now and then, with some possible solutions. This is by no means a definitive list of problems and solutions, but it’s a start.
Layered navigation (sometimes called filtered, faceted or guided navigation) is a wonderful thing in ecommerce sites which I personally love using as a user. It’s gotten a more popular over the last few years, in part due to the rising popularity of Magento which has layered navigation out of the box. Most enterprise search software packages also include some sort of layered navigation option. However, when you add a facet to your search, you often add a new URL. There is effectively an infinite number of pages, many containing the same or similar content.
This can be both an opportunity and a problem. By carefully managing canonicalization, you can create a large number of useful pages while eliminating useless pages. For example, if you are selling bikes and have 3 facets: price range, type (mountain, racing), and age (child, teen, adult). Pages on “mountain bikes for teens between $500 and $750” and “mountain bikes for teens between $750 and $1200” are not likely to be useful. Those pages should be eliminated. Generally the best way is to use the canonical tag. However, a page on “mountain bikes for teens” may well be useful and may attract some high quality long tail searches. So this variation should be allowed to live. The idea is to pick which facets add value from a search perspective and which don’t. Price rarely adds value. The other facets are very dependent on your niche. If you are unsure, safer to leave it out, or consult an expert.
This can be non-trivial to implement technically in some cases, but it can be worthwhile. I’m not aware of a Magento extension which handles this well for Magento. If one exists, please let me know!
Ecommerce Product Pages
This applies to other general CMS scenarios as well at times, but the problems are most common in ecommerce shopping cart software. This is a bit fuzzy as it varies a lot depending on your ecommerce platform and configuration. Some problems I’ve seen:
Pages in Multiple Categories
Often it makes sense to have the same page in multiple categories. For example, on a bike shop, you might have a mountain bike category and a teen category. So you could end up in a scenario where you have the same content at:
Some ecommerce platforms solve this by having a default category so the problem above doesn’t exist. That is a mixed blessing and can cause other usability and IA issues beyond the scope of this discussion. The solution when there are two URLs is to nominate a default category and use the canonical tag. Your software needs to have support for this, and it can be tough to explain to editors.
Multiple Versions of the Product Pages
This is pretty fuzzy, but can and does happen. For example, I saw a site recently where there were standard product pages, but then also a near-identical version (without the menu, etc) as a popup that would appear from a link in the checkout and a few other places.
Don’t do it. If you must for whatever reason, use a canonical tag on the rogue page back to the main page, or a simpler, not quite as good but good enough solution is to add a meta noindex tag to the rogue page. Chances are rogue pages will attract no inbound links so noindex won’t be of any harm.
Different URLs for the Same Page
This is typically sloppy programming. Some examples are:
also appears as
Another one I see a lot is
however, 10384 is the actual key, and product-name is just there for vanity reasons and is ignored by the software. You could type in
and the exact same content would appear.
For the first one, if the system is well programmed then the product.php will never appear anywhere and so the spiders will be oblivious to it so there’s no problem. However, this isn’t always the case. The solution is to make it the case. There are other work arounds but they are all inferior.
For the second scenario, while that is quite common, in my experience software that does that invariably does it well and variations of the vanity string are never seen, so I wouldn’t worry about it. It should be fine. There are some edge cases where problems can occur, but I’m trying not to turn this into a 100,000 word book (although I suspect you could write 100,000 words on canonicalization). One of the edge cases is where smart alecs like me change the string to something funny so an observant webmaster will see some bizarre page requests appearing in their Google Analytics report! I mentioned my habit of doing this from time to time to Danny at FIRST and it turns out he does exactly the same thing!
A quite common problem is tracking codes of various types. Some of these include:
http://www.site.com/affid=1234 if you have an affiliate program
www.site.com/?utm_source=somesource&utm_medium=somemedium&utm_campaign=somecampaign if you are using Google Analytics campaign tracking (which you probably should be). Adobe Site Catalyst and other packages have similar codes.
There are other scenarios where various redundant query strings like that are appended.
There’s a few ways to solve this. By far the easiest is to go into Google Webmaster Central and tell Google to ignore that query string. 5 minutes, done. There are other solutions, you can use 301s, canonical tags, replace ‘?’ with ‘#’ (which can be done with Google Analytics with a minor tweak), and probably others I can’t think of, but they all feel like more work and more error prone to me.
The web is fundamentally case sensitive. www.site.com/page.html and www.site.com/Page.html are different URLs. Unfortunately for those running on Microsoft’s IIS Server, it sees them as the same. This can encourage sloppy case insensitive programming which on a *nix server would show as an error. I’ve seen sites with internal links pointing to www.site.com/page.aspx and www.site.com/Page.aspx. These pages could then be crawled by Google and potentially appear as duplicate content.
Slightly off topic, this can cause problems with robots.txt. I once worked with a client who had a page which it was quite important to not be indexed. I added www.site.com/secretpage.html to robots.txt. The site was running on IIS. Unfortunately, there was a number of both internal and external links pointing to www.site.com/Secretpage.html, www.site.com/SecretPage.html, www.site.com/SECRETPAGE.html and so on. Given robots.txt is case sensitive I ended up adding more and more case variants to the robots.txt and eventually it was de-indexed. This is a nasty scenario, and I’m not aware of a work around – please let me know if you are (apart from “don’t use IIS”!).
To the best of my knowledge, there’s no easy way to make IIS case sensitive. There might be some ISAPI solution, and it can be done at an application level, but in practice both those solutions are generally not feasible. The solution here is simple good coding practice and hope for the best. There are probably some advanced things you could do, but in practice they would be more work than good coding practice. As for the robots.txt problem, it’s got me stumped.
Out of the box, WordPress is notorious for having canonical issues. A new blog post can appear at:
- www.blog.com/2011/post-name.html (or similar depending on configuration)
- www.blog.com/2011/12 (the archives)
and a few more. So which is the canonical version?
In the case of WordPress it’s pretty easy – just install the Yoast SEO plugin for WordPress. For other blog software which will typically have similar problems, that’s a bit trickier. You are going to have to do some homework or call in some experts.
Content Distribution Networks
A CDN (content distribution network) often uses cnames to map many subdomains to the relevant local data centre. Even a single sub-domain, such as used by the Amazon AWS CDN, is still potentially duplicate content. So, the same content could be coming from
- Use the cross domain canonical tag
- Configure your CDN to use the rel=”http” canonical HTTP headers
By Mark Baartse, Consulting Director