by Karen Stevenson on January 8, 2014 // Short URL

Sending a Drupal Site Into Retirement

Ideas for how to gracefully retire (or semi-retire) a Drupal site using HTTrack and GitHub Pages.

Drupal is a great tool for creating a site. It has lots of modules and functionality that allow you to build interesting and complex features. But sometimes those sites lose their relevancy. It's a site for an event that has passed, for instance. Or a site for a topic that was really important at one time but now is mostly useful as a reference for the content it contains. Or it's a site you just don't have time to keep on top of. In all these cases you could just take the site down entirely, but often it contains useful information that you'd like to keep online, and if there are other people linking to it, it would be nice not to break all those connections.

But maintaining an inactive Drupal site can be a pain. There is a constant stream of security releases that you need to apply. And it's really maddening if you apply a security release to an inactive site only to find out the release contains other changes that break things that used to work, so that you have to spend time trying to get that inactive site working again. Not to mention that it's expensive to pay for hosting that can securely deploy Drupal sites if you aren't even using any Drupal interactivity any more.

One solution is to convert the site to static HTML pages. A site serving up only static pages, with no database or Drupal back end running, is likely to be pretty secure. And it will serve pages very quickly as well.

My Solution: HTTrack and GitHub Pages

There are various ways to accomplish this. You can use wget to spider a site and copy pages, or try out the Drupal Boost module (which creates static pages but still requires that Drupal be installed behind it). I finally settled on a solution that uses HTTrack to spider my Drupal site and create static pages without any dependency on Drupal. To serve those pages I will use GitHub Pages. I'm already using GitHub and GitHub Pages are free. GitHub Pages can be used to deploy Jekyll sites, but Jekyll is perfectly happy to serve up static HTML, so I don't have to do anything but create functional HTML pages to get this solution working.

I created a project on GitHub to try this idea out. I created the original Drupal site, Save My Airport, as a protest when the FAA announced they were going to close the airport watch towers at dozens of smaller airports as a cost-cutting move. My airport was one of the ones affected, but I was equally incensed about the impact to other small airports, so I did what I do, I created a web site about the problem. The problem has receded in urgency, but is likely to re-emerge because they didn't come up with a permanent solution. So what I really want to do is semi-retire the site. I can re-deploy it later if necessary.

It's a fairly complex site, created using Panels and lots of views. I used Feeds to pull in statistics about all the airports in the country and created a page for each airport with a map, traffic and other statistics, and information about what FAA actions affected it. I also pulled in links to news about the topic from all over the web, and there are a couple paged views of airports and news.

pagermaps

Transforming all this to static pages would not be a walk in the park.

Inactivate the Site

There are several steps to take with any site that is not going to get regular attention, whether or not you are going to archive it or create static files. These include:

Clean up any Views views

  • Remove exposed filters
  • Remove clickable table column headers
  • Don't use ajax

Other Tasks

  • Disable all comments (or only use third party comments, like Disqus or Facebook comments)
  • Remove the contact form
  • Disable search (or only use third party search, like Google)
  • Remove login and user blocks
  • Make sure js and css aggregation are turned on

One final task is to make sure no error messages will appear in your static content. Find the following in page.tpl.php and either remove it or comment it out while you're spidering the site:

<?php
print $messages;
?>

Finally, review the site as an anonymous user to see if there are any other elements that won't work if Drupal is not actively running in the background.

Create GitHub Page

I started by creating a new repository and setting it up to use GitHub Pages. I just followed the instructions to create a simple Hello World repository to be sure it was working. Basically it's a matter of creating a branch called "gh-pages" in the repository, and then committing a index.html file that echos back "Hello World".

I created a repository at http://github.com/karens/savemyairport. I could view my new page at http://karens.github.io/savemyairport

Create Static Pages with HTTrack

The easiest way to install httrack on a Mac is with Homebrew:

brew install httrack

I spent some time trying to find the ideal way to use HTTrack from the documentation. I finally came up with the following command. Change into the new GitHub Pages directory on your machine, and execute the following command:

httrack http://LOCAL_URI -O . -N "%h%p/%n/index%[page].%t" -WqQ%v --robots=0

One of the biggest problems of transforming a dynamic site into static pages is that the urls must change. The 'real' url of a Drupal page is 'index.php?q='/news', or 'index.php?q=/about', i.e. there is really only one HTML page that dynamically re-renders itself depending on the requested path. A static site has to have one HTML page for every page of the site, so the new url has to be '/news.html' or '/news/index.html'. The good thing about the second option is that incoming links to '/news' will automatically be routed to /news/index.html' if it exists, so that second pattern is the one I want to use.

The -N flag in the command will rewrite the pages of the site, including pager pages, into the pattern "/about/index.html". Without the -N flag, the page at "/about" would have been transformed into a file called "about.html".

The pattern also tells httrack to find a value in the query string called "page" and insert that value, if it exists, into the url pattern in the spot marked by [page]. Paged views will create links like "/about/index2.html", "/about/index3.html" for each page of the view. Without specifying this, the pager links would have been created as meaningless hash values of the query string. This way the pager links are user friendly and similar (but not quite the same) as the original link urls.

Shortly after the process starts it will stop and ask you a question about how far to go in following links. I answer '*' to that question:

question

I ran HTTrack on a local version of my site and it took about a half hour to spider the site and create about 2,000 files, including pages for every airport and news item and every page of my paged views. You can use HTTrack across the network on the live site url, but that would be very slow, so it makes sense to do this on a local copy if possible.

Watch the progress as it goes to see what sections of the site it is navigating into. The '%v' flag in the command tells it to use verbose output.

verbose

If you see it veering into sections you don't want saved, you can add something like the following to keep it out of a particular sub-section:

-/news*

I then committed this to the gh-pages branch of my repository, and in a few minutes I could view the result at http://karens.github.io/savemyairport.

There was one final bit of clean up to make. Although incoming links to /airports/ryan-field will now work, internal links still look like this in the HTML:

/airports/ryan-field/index.html

A quick command line fix to clean that up is to run this, from the top of the directory that contains the static files:

find . -name "*.html" -type f -print0 | xargs -0 perl -i -pe "s/\/index.html/\//g"

That will change all the internal links in those 2,000 pages from "/airports/ryan-field/index.html" to "/airports/ryan-field/", and I now have a static site that pretty closely mirrors the original file structure and URL pattern of the original site.

The final step is to have the old domain name redirect to the new GitHub Pages site. GitHub provides instructions about how to do that.

Next Steps

For some sites, this is all there is to do. The sites are now retired and will never change again. The Drupal site that created it can be taken down and these pages can live on as a permanent archive of the site.

But in the case of a semi-retired site there is the question about how to make occasional changes in the future.

My current plan is to maintain the local Drupal installation but keep it offline. If I want to make changes in the future, I'll update my local site and then re-generate the static pages using the method above. Since Drupal is not publicly available, I won't have to update or maintain it, or worry about security updates, as long as it works well enough to re-generate the site when necessary. Each time I make changes locally I'll have to re-generate the static pages using HTTrack and push the changes up, but if I'm not making changes very often that will work out fine and it preserves the option of bringing the site back up as a Drupal site in the future if events warrant.

Another idea for a site in semi-retirement is to use HTTrack to actually transform it into a Jekyll site, where the static pages can live on as-is, but I can periodically add some new content to the News section. I decided that is another intriguing idea that I'll explore in another article.

If you're interested, you can view Save My Airport, which is now a fully static site hosted in GitHub pages, created from a Drupal site using the process outlined above.

Karen Stevenson

Senior Drupal Architect

Want Karen Stevenson to speak at your event? Contact us with the details and we’ll be in touch soon.

Comments

Jakob Persson

SiteSucker

Thanks for the post, Karen! Just wanted to mention I've used a software called SiteSucker to create static versions of my sites. It's probably not as feature–rich as httrack but the user interface is quite intuitive and it does what it should. It's free and available in the Mac App Store.

Reply

KarenS

One of the things I like

One of the things I like about the HTTracker solution is the ability to control the URLs it creates. I can make a site where the original URLs will still work and incoming links from other sites are unbroken. But there are other solutions that are less work if you don't care about that.

Reply

KarenS

Also I was concerned about

Also I was concerned about making sure that the multi-page tables would still work, with links to the additional pages that made sense, and this solution did that. For instance, if you look at http://savemyairport.com/airports/ you can see that the static table paging looks and works nicely, almost exactly like the original. I don't know for sure how other spidering solutions would handle that.

Reply

KarenS

Yes, I mentioned wget in the

Yes, I mentioned wget in the article. The problem with wget is that it won't rewrite the urls, a page at /about will become about.html and I want it to be about/index.html.

Reply

Mark Figart

Great idea that parallels my own post for today

Karen, I'm seeing so much more about the value of static sites that I just posted my own article about it today: http://bit.ly/1eUGmLQ

I saw Amitai's work (he commented above) recently, and was already familiarizing myself with Jekyll as well ( http://bit.ly/1mbwEtH). I think there is so much to this whole approach of pairing the CMS with static files.

Wasn't aware of httrack, so can't wait to check it out. We also use sitesucker around here a lot, which I've always assumed was just some sort of wrapper around wget... though I could be wrong.

Anyway, thanks for imparting some great ideas here. Excellent post.

Mark

Reply

Ronan

HTTrack exactly what I need

I am building a site at the minute that will be useless to me in a few months . . . hadn't even thought of saving it, just deleting it.

Does HTTrack work the same as your normal filezilla programmes and other file transfer programmes?

Reply