A Toolset For Enterprise Content Inventories

A collection of handheld tools

Earlier this year, Lullabot began a four-month-long content strategy engagement for the state of Georgia. The project would involve coming up with a migration plan from Drupal 7 to Drupal 8 for 85 of their state agency sites, with an eye towards a future where content can be more freely and accurately shared between sites. Our first step was to get a handle of all the content on their existing sites. How much content were we dealing with? How is it organized? What does it contain? In other words, we needed a content inventory. Each of these 85 sites was its own individual install of Drupal, with the largest containing almost 10K unique URLs, so this one was going to be a doozy. We hadn't really done a content strategy project of this scale before, and our existing toolset wasn't going to cut it, so I started doing some research to see what other tools might work. 

Open up any number of content strategy blogs and you will find an endless supply of articles explaining why content inventories are important, and templates for storing said content inventories. What you will find a distinct lack of is the how: how does the data get from your website to the spreadsheet for review? For smaller sites, manually compiling this data is reasonably straightforward, but once you get past a couple hundred pages, this is no longer realistic. In past Drupal projects, we have been able to use a dump of the routing table as a great starting point, but with 85 sites even this would be unmanageable. We quickly realized we were probably looking at a spider of some sort. What we needed was something that met the following criteria:

  • Flexible: We needed the ability to scan multiple domains into a single collection of URLs, as well as the ability to include and exclude URLs that met specific criteria. Additionally, we knew that there would be times when we might want to just grab a specific subset of information, be it by domain, site section, etc. We honestly weren't completely sure what all might come in handy, so we wanted some assurance that we would be able to flexibly get what we needed as the project moved forward.
  • Scalable: We are looking at hundreds of thousands of URLs across almost a hundred domains, and we knew we were almost certainly going to have to run it multiple times. A platform that charged per-URL was not going to cut it.
  • Repeatable: We knew this was going to be a learning process, and, as such, we were going to need to be able to run a scan, check it, and iterate. Any configuration should be saveable and cloneable, ideally in a format suitable for version control which would allow us to track our changes over time and more easily determine which changes influenced the scan in what ways. In a truly ideal scenario, it would be scriptable and able to be run from the command line.
  • Analysis: We wanted to be able to run a bulk analysis on the site’s content to find things like reading level, sentiment, and reading time. 

Some of the first tools I found were hosted solutions like Content Analysis Tool and DynoMapper. The problem is that these tools charge on a per-URL basis, and weren't going to have the level of repeatability and customization we needed. (This is not to say that these aren't fine tools, they just weren't what we were looking for in terms of this project.) We then began to architect our own tool, but we really didn't want to add the baggage of writing it onto an already hectic schedule.

Thankfully, we were able to avoid that, and in the process discovered an incredibly rich set of tools for creating content inventories which have very quickly become an absolutely essential part of our toolkit. They are:

  • Screaming Frog SEO Spider: An incredibly flexible spidering application. 
  • URL Profiler: A content analysis tool which integrates well with the CSVs generated by Screaming Frog.
  • GoCSV: A robust command line tool created with the sole purpose of manipulating very large CSVs very quickly.

Let's look at each of these elements in greater detail, and see how they ended up fitting into the project.

Screaming Frog

 

A screenshot of the Screaming Frog SEO Spider
The main workspace for the Screaming Frog SEO Spider

Screaming Frog is an SEO consulting company based in the UK. They also produce the Screaming Frog SEO Spider, an application which is available for both Mac and Windows. The SEO Spider has all the flexibility and configurability you would expect from such an application. You can very carefully control what you do and don’t crawl, and there are a number of ways to report the results of your crawl and export it to CSVs for further processing. I don’t intend to cover the product in depth. Instead, I’d like to focus on the elements which made it particularly useful for us.

Repeatability

A key feature in Screaming Frog is the ability to save both the results of a session and its configuration for future use. The results are important to save because Screaming Frog generates a lot of data, and you don’t necessarily know which slice of it you will need at any given time. Having the ability to reload the results and analyze them further is a huge benefit. Saving the configuration is key because it means that you can re-run the spider with the exact same configuration you used before, meaning your new results will be comparable to your last ones. 

Additionally, the newest version of the software allows you to run scans using a specific configuration from the command-line, opening up a wealth of possibilities for scripted and scheduled scans. This is a game-changer for situations like ours, where we might want to run a scan repeatedly across a number of specific properties, or set our clients up with the ability to automatically get a new scan every month or quarter.

Extraction

 

A screenshot showing Screaming Frog's extraction configuration
The Screaming Frog extraction configuration screen

As we explored what we wanted to get out of these scans, we realized that it would be really nice to be able to identify some Drupal-specific information (NID, content type) along with the more generic data you would normally get out of a spider. Originally, we had thought we would have to link the results of the scan back to Drupal’s menu table in order to extract that information. However, Screaming Frog offers the ability to extract information out of the HTML in a page based on XPath queries. Most standard Drupal themes include information about the node inside the CSS classes they create. For instance, here is a fairly standard Drupal body tag.

<body class="html not-front not-logged-in no-sidebars page-node page-node- page-node-68 node-type-basic-page">

As you can see, this class contains both the node’s ID and its content type, which means we were able to extract this data and include it in the results of our scan. The more we used this functionality, the more uses we found for it. For instance, it is often useful to be able to identify pages with problematic HTML early on in a project so you can get a handle on problems that are going to come up during migration. We were able to do things like count the number of times a tag was used within the content area, allowing us to identify pages with inline CSS or JavaScript which would have to be dealt with later.

We’ve only begun to scratch the surface of what we can do with this XPath extraction capability, and future projects will certainly see us dive into it more deeply. 

Analytics

 

A screenshot showing some of the metrics available through Screaming Frog's integration with Google Analytics
A sample of the metrics available when integrating Google Analytics with Screaming Frog

Another set of data you can bring into your scan is associated with information from Google Analytics. Once you authenticate through Screaming Frog, it will allow you to choose what properties and views you wish to retrieve, as well as what individual metrics to report within your result set. There is an enormous number of metrics available, from basics like PageViews and BounceRate to extended reporting on conversions, transactions, and ad clicks. Bringing this analytics information to bear during a content audit is the key to identifying which content is performing and why. Screaming Frog also has the ability to integrate with Google Search Console and SEO tools like Majestic, Ahrefs, and Moz.

Cost

Finally, Screaming Frog provides a straightforward yearly license fee with no upcharges based on the number of URLs scanned. This is not to say it is cheap, the cost is around $200 a year, but having it be predictable without worrying about how much we used it was key to making this part of the project work. 

URL Profiler

 

Screenshot of the main workspace for URL Profiler
The main workspace for URL Profiler

The second piece of this puzzle is URL Profiler. Screaming Frog scans your sites and catalogs their URLs and metadata. URL Profiler analyzes the content which lives at these URLs and provides you with extended information about them. This is as simple as importing a CSV of URLs, choosing your options, and clicking Run. Once the run is done, you get back a spreadsheet which combines your original CSV with the data URL Profiler has put together. As you can see, it provides an extensive number of integrations, many of them SEO-focused. Many of these require extended subscriptions to be useful, however, the software itself provides a set of content quality metrics by checking the Readability box. These include

  • Reading Time
  • 10 most frequently used words on the page
  • Sentiment analysis (positive, negative, or neutral)
  • Dale-Chall reading ease score
  • Flesh-Kincaid reading ease score
  • Gunning-Fog estimation of years of education needed to understand the text
  • SMOG Index estimation of years of education needed to understand the text


While these algorithms need to be taken with a grain of salt, they provide very useful guidelines for the readability of your content, and in aggregate can be really useful as a broad overview of how you should improve. For instance, we were able to take this content and create graphs that ranked state agencies from least to most complex text, as well as by average read time. We could then take read time and compare it to "Time on Page" from Google Analytics to show whether or not people were actually reading those long pages. 

On the downside, URL Profiler isn't scriptable from the command-line the way Screaming Frog is. It is also more expensive, requiring a monthly subscription of around $40 a month rather than a single yearly fee. Nevertheless, it is an extremely useful tool which has earned a permanent place in our toolbox. 

GoCSV​

One of the first things we noticed when we ran Screaming Frog on the Georgia state agency sites was that they had a lot of PDFs. In fact, they had more PDFs than they had HTML pages. We really needed an easy way to strip those rows out of the CSVs before we ran them through URL Profiler because URL Profiler won’t analyze downloadable files like PDFs or Word documents. We also had other things we wanted to be able to do. For instance, we saw some utility in being able to split the scan out into separate CSVs by content type, or state agency, or response code, or who knows what else! Once again I started architecting a tool to generate these sets of data, and once again it turned out I didn't have to.

GoCSV is an open source command-line tool that was created with the sole purpose of performantly manipulating large CSVs. The documentation goes into these options in great detail, but one of the most useful functions we found was a filter that allows you to generate a new subset of data based on the values in one of the CSV’s cells. This allowed us to create extensive shell scripts to generate a wide variety of data sets from the single monolithic scan of all the state agencies in a repeatable and predictable way. Every time we did a new scan of all the sites, we could, with just a few keystrokes, generate a whole new set of CSVs which broke this data into subsets that were just documents and just HTML, and then for each of those subsets, break them down further by domain, content type, response code, and pre-defined verticals. This script would run in under 60 seconds, despite the fact that the complete CSV had over 150,000 rows. 

Another use case we found for GoCSV was to create pre-formatted spreadsheets for content audits. These large-scale inventories are useful, but when it comes to digging in and doing a content audit, there’s just way more information than is needed. There were also a variety of columns that we wanted to add for things like workflow tracking and keep/kill/combine decisions which weren't present in the original CSVs. Once again, we were able to create a shell script which allowed us to take the CSVs by domain and generate new versions that contained only the information we needed and added the new columns we wanted. 

What It Got Us

Having put this toolset together, we were able to get some really valuable insights into the content we were dealing with. For instance, by having an easy way to separate the downloadable documents from HTML pages, and then even further break those results down by agency, we were able to produce a chart which showed the agencies that relied particularly heavily on PDFs. This is really useful information to have as Georgia’s Digital Services team guides these agencies through their content audits. 

 

Graph showing the ratio of documents to HTML pages per state agency
Ratio of documents to HTML pages per state agency

One of the things that URL Profiler brought into play was the number of words on every page in a site. Here again, we were able to take this information, cut out the downloadable documents, and take an average across just the HTML pages for each domain. This showed us which agencies tended to cram more content into single pages rather than spreading it around into more focused ones. This is also useful information to have on hand during a content audit because it indicates that you may want to prioritize figuring out how to split up content for these specific agencies.

 

Graph showing the the average word count of all content per state agency, grouped by pages of text.
Average word count per state agency, grouped by how many pages of text they have.

Finally, after running our scans, I noticed that for some agencies, the amount of published content they had in Drupal was much higher than what our scan had found. We were able to put together the two sets of data and figure out that some agencies had been simply removing links to old content like events or job postings, but never archiving it or removing it. These stranded nodes were still available to the public and indexed by Google, but contained woefully outdated information. Without spidering the site, we may not have found this problem until much later in the process. 

Looking Forward

Using Screaming Frog, URL Profiler, and GoCSV in combination, we were able to put together a pipeline for generating large-scale content inventories that was repeatable and predictable. This was a huge boon not just for the State of Georgia and other clients, but also for Lullabot itself as we embark on our own website re-design and content strategy. Amazingly enough, we just scratched the surface in our usage of these products and this article just scratches the surface of what we learned and implemented. Stay tuned for more articles that will dive more deeply into different aspects of what we learned, and highlight more tips and tricks that make generating inventories easier and much more useful. 

Greg Dunlap

Thumbnail
Greg has been involved with Drupal for 8+ years, specializing in configuration management and deployment issues.