by Jeff Eaton on January 21, 2013 // Short URL

Module Monday: Link Checker

In honor of Module Monday's post-holiday return, we're taking a look at a problem that plagues many sites: dead links. If you maintain content that contains links to other sites, it's inevitable that some of them will ultimately go bad. Domains expire, sites go down, articles are unpublished, blogs migrate to a new CMS and change their URL patterns... and eventually you're left with a dusting of broken URLs in your otherwise pristine content. That's where LinkChecker comes in. It's a module for Drupal 6 and 7 that scans your content for busted links, tells you what nodes need fixing, and -- optionally -- tidies up the ones it can fix automatically.

Screenshot of administration screen

LinkChecker runs at cron time on your Drupal site, churning its way through a bite-size number of nodes each time and scanning them for URLs. It pings those URLs, makes sure a working web page is still there, and moves on. If not, it logs the specific HTTP error for later reference and moves on. It's handy, but the devil is always in the details -- and LinkChecker is designed to handle all of them with aplomb. Do you need to white-list certain content types to ensure they aren't scanned? No problem. Need to make sure that dummy URLs like "example.com" don't get checked and generate false positives? It handles that, too. Need to hunt for urls contained in dedicated CCK or FieldAPI Link fields, in addition to text fields and node bodies? No problem. Want to check image links that reside on remote servers, or check URLs that are generated by Drupal input filters even though they don't appear in the "raw" text of the node? LinkChecker allows a site administrator to toggle all of those options and more.

The module can correct kinds of errors automatically (301 redirects, for example) but it's up to the site's administrator to check the report that it generates for news about broken links. There, each node with busted links can be reviewed and edited.

Screenshot of resulting change to site

LinkChecker is a tremendously useful tool, and its smorgasbord of configuration options means that it can deal with lots of oddball edge cases. One missing option that would still be welcome? An easy way to export the "busted links" report to a text file for review. For extremely large sites, dedicated third-party web crawlers with link checking functions may be a more robust solution, but for Drupal admins who need a hand keeping their sites tidy, it's a lifesaver.

Jeff Eaton

Senior Digital Strategist

Want Jeff Eaton to speak at your event? Contact us with the details and we’ll be in touch soon.

Comments

Walter Daniels

I get a lot of false not founds

This look like they are caused by the checker timing out on sites as they work fine when I click on them. Do others have this problem?

Reply

Alex

If you can post a few of

If you can post a few of these examples to linkchecker queue we can review what's going wrong there. In most cases it's your firewall on the server that is running linkchecker... :-). Keep in mind linkchecker reports what it get... If something in the underlying infrastructure (php, firewall, loopback interface) is not behaving correctly, linkchecker cannot make wonders happen. But nothing to worry... All this stuff can be fixed on admin side.

Reply

Rob

Performance Improvements

Been using this one for a while and always had a bit of trouble with the performance working through quite a large site, but the latest release added support for parallel link checking via HTTPRL and has been a big improvement.

Reply

Alex

What do you mean with

What do you mean with "performance troubles"? Linkchecker does not require a lot of performance. It's running very smart in the background. It's just cron that runs for about 120 seconds (as it need to check all the links). But the cpu is nearly zero and memory also low as drupal_http_request() most of the time only waits for remote servers to answer. With HTTPRL - cron is still running - now 180s long, but it's "invisible" in the background and you do not see the long cron run time, if you execute cron.php via url, that's all. You must see higher cpu/db load with HTTPRL compared to core, but the time frame to complete all these many many link checks may be ~10 times less.

How many links are you checking (number is next to HTTPRL settings)? I'm still looking for extremely large sites with hundred thousands of links to check and how it performs.

Reply

Anonymous

broken links check

This is one fantastic FREE tool. It will show you your links HTTP status code whether its etc along with result for broken links which is a plus point if you are considering checking your website’s broken links check.

Reply

James olds

scanner

Hi, We provide professional onsite services to business and corporations. Customers can expect full service and satisfaction with an affordable price. We repair most all brands, and we are a certified warranty repair center for Sharp, Brother, and more. In addition to any necessary repairs, each piece of equipment is cleaned, adjusted and thoroughly tested before delivery.Because your time is very important, we provide quick response times and strive to diagnose and resolve problems quickly and easily. We work with your schedule to get your equipment up and running quickly and reliably. http://westequipment.com

Reply

Tajre

Scanner

Hi, We provide professional onsite services to business and corporations. Customers can expect full service and satisfaction with an affordable price. We repair most all brands, and we are a certified warranty repair center for Sharp, Brother, and more. In addition to any necessary repairs, each piece of equipment is cleaned, adjusted and thoroughly tested before delivery.Because your time is very important, we provide quick response times and strive to diagnose and resolve problems quickly and easily. We work with your schedule to get your equipment up and running quickly and reliably. http://westequipment.com

Reply