Drupal, duplicate content, and you

Does Google's "duplicate content penalty" harm Drupal sites? No! Here's why.

For years, Drupal has enjoyed a solid reputation as a search engine friendly CMS. It generates relatively clean, standards-compliant HTML out of the box; syncs up the important TITLE tag with semantically useful H1 and H2 tags in the body of each page; and provides short, human-readable URLs with plentiful options for customization. (Anecdotal evidence: several years back, I wrote a post on my Drupal-powered blog that mentioned the name of the company I worked for. Within two weeks, my blog post ranked higher than the company's own web site on Google.)

Recently, I've witnessed a number of discussions where people expressed concern about the way Drupal generates the human-readable URLs that help make it Google-friendly. In particular, they were worried about Google's dreaded Duplicate Content Penalty, a system designed to keep spammers from flooding Google with the same content at dozens (or hundreds!) of URLs. There's a lot of confusion floating around, so for the geeks in the crowd (and the not-so-geeky interested in learning how things work behind the scenes), I thought it would be useful to give a guided tour of how Drupal manages and generates URLs.

Ground Zero: index.php

Every page generated by Drupal has a unique "path" that's used to identify it internally. Individual pieces of content live at paths like node/1, node/2, and so on. Unique administration pages get paths like admin/settings/files or admin/content/comments. Other modules like Views, Poll, and so on can add other paths.

It's a good starting point, but at this point, Drupal's URL structure is still as ugly as any other PHP web-app. Why? All requests for pages are routed through the index.php script at Drupal's heart, and the "path" of the desired page is passed along as an additional bit at the end of the url: http://www.example.com/index.php?q=node/1 is one such example. In the screenshot below, I'm accessing an article on the Drupal.org web page using this basic URL structure.















no-clean-urls.jpg

Clean URLs to the rescue!

Fortunately, few Drupal sites use those ugly URLs. The vast majority of web servers (Apache, IIS, and many of the smaller players) can finesse incoming URLs, routing simple URLs like http://www.example.com/node/1 through the index.php script automatically. Drupal is configured to take advantage of it automatically, and Drupal 6.0 and later will double-check to ensure your web server supports the feature during installation.

In the screenshot below, I'm accessing the same page on Drupal.org using the "clean" version of the URL: it's just the site's domain, followed by the path, without any of the ugly index.php business cluttering things up.















clean-urls.jpg

But wait, there's more!

Eliminating the ugly cruft only gets us half-way there. We still have content at relatively meaningly paths like node/1 and node/2. That's where Drupal's path module comes in. It allows site administrators to define aliases for any path on the web site, turning URLs like http://www.example.com/node/1 into http://www.example.com/about-us. In the final screenshot, below, I'm accessing the same article on Drupal.org using its path alias.















path-alias.jpg

Path aliasing is particularly useful when combined with the PathAuto module. It allows site administrators to set up rules that generate path aliases for content automatically. When I first moved my blog from Movable Type to Drupal, it allowed me to mirror my existing URL structure without any manual tweaking.

When path aliases for nodes are set up to include the node title, search engines are pleased, too. Most search algorithms pay extra attention to text that appears in a page's URL, in the page's title, and inside of important tags like H1 and H2.

Flies in the ointment

If you were paying attention during the explanation above, you noticed that content on a Drupal site can be given friendly URLs, but it stays available at the original, unfriendly URL as well. As far as most search engines are concerned, that means that you have multiple copies of the same content on your web site, and that raises all sorts of alarms. It's common knowledge that many search engines -- Google in particular -- penalize sites for putting duplicate pages at different URLs. Without that protection, unethical site owners could easily put thousands of copies of an article on their site and quickly become the "ultimate source for information" on a topic, even though they only have a tiny amount of unique content.

Does that mean that Drupal sites using path aliases are hurting themselves in the long run? Thankfully, the answer is no. First, Google's documentation for webmasters explains that the only "penalty" is that only one copy of the content will be listed in search results. In fact, a recent web post on the Google blog bent over backwards to clarify:

Let's put this to bed once and for all, folks: There's no such thing as a "duplicate content penalty." At least, not in the way most people mean when they say that.

In a related post, Deftly dealing with duplicate content, they explain that the only real concern is making sure that the right path for your page gets displayed when people search on Google.

One of the most important tips mentioned in that post is being consistent when you link to your site's pages. Because Google indexes your site by automatically following all of its links, you should always be sure that URLs on your pages point to the "proper" path rather than the default node/1 style.

Internally, Drupal does this automatically: all URLs are passed through the l() function before they're displayed. Internally, modules always deal with the standard path (node/1, user/1, and so on) for a page on the site. Before outputting any links to a browser, they hand the l() function the standard path, and it spits out the "best" possible version of a given path: a friendly alias if one is available, the default path if the web server supports clean URLs, and the "ugly" index.php style if no other options are available. All Drupal modules are expected to use this function rather than hard-coding URLs: in fact, code that doesn't use the l() function is considered buggy.

Covering all the bases

Thanks to the l() function, links generated by Drupal will always point to the "clean" version of the URL and Google will never see the duplicate versions. The unfriendly URLs, though, are still sitting there: what happens if other people link to them, and Google follows those links?

Google recommends using HTTP 301 redirects to solve this problem: they tell web browsers that the requested content actually lives at another URL. Web browsers will automatically jump to the correct URL, and search engine web-crawlers respect these redirects as well.

In Drupal, the Global Redirect module generates 301 redirects whenever a user visits a standard URL when a friendly path alias has been defined. It also generates a 301 redirect if someone visits an old-style "ugly" url like http://www.example.com/index.php?q=node/1. The end result? No more duplicate content, period. Google will always see your content at the best possible URL, regardless of how users link to it.

Recap

For everyone who's read this far (or those who skipped to the end for the "good parts"), a summary is in order.

  1. Drupal gives content friendly URLs with the Path module, and automates the process with the PathAuto module. However, content remains available at the old URLs as well.
  2. Thanks to the l() function, Drupal outputs the best possible version of the URL when generating links to internal content.
  3. If Google finds links to the "ugly" URLs, it will index them, but only one version of the page will be displayed in search results.
  4. To ensure the best version of every URL appears in Google search results, use the Global Redirect module.

Get in touch with us

Tell us about your project or drop us a line. We'd love to hear from you!