Calais? I get e-mail about that at least 12,238 times a day. Buzz off.
No, not Cialis, silly... Calais (pronounced cull-AY). It's a free (as in cost) natural language processing, rich semantic metadata, web service, uh... thingy. The video on the front page is jargon-tastic for those into that sort of thing. But to cut to the chase, it basically reads in text from your site and, based on the bazillions of others sites' text from other people using the service, it figures out some sensible tags for you automatically so your editors don't have to do it.
However, rather than just being a simple flat set of free tags, the tags are instead grouped into areas analogous to Drupal's taxonomy vocabularies which are associated with "Entities" (People, Companies, Cities, etc.), "Facts" (a person's Position or relationships between entities), and "Events" (Sporting, ManagementChange, etc.). The format passed back is in an open, semantic web-compatible format (Resource Description Framework or RDF) which then allows you to form intelligent relationships between articles based on the subject matter. This can be used for things like assisting with SEO, getting better search results, creating an "Other articles like this" block, pulling in external data from other sources that speak RDF, or whatever else you can imagine doing with this kind of information.
How does it work?
To get an idea of how it works, chuck some text at http://viewer.opencalais.com/. For example, here's a Wikinews teaser about President Obama:
It correctly identifies the topic of the article as "Politics," "Washington,United States" as a "City" that the article is about, "Barack Obama" and "George W. Bush" as "Person" entities, and even finds a quotation by George W. Bush (though sadly, not as entertaining as some). Pretty nifty!
So, wanna wire this up with Drupal? Let's find out how!
Stuff you will need
- Calais module: http://drupal.org/project/opencalais. I'm using 6.x-3.1
- Resource Description Framework (RDF) module: http://drupal.org/project/rdf. I'm using 6.x-1.0-alpha7
- ARC 2 RDF classes for PHP: http://arc.semsol.org/download. I'm using ARC 2 from 2009-03-05.
- A Calais API key from http://opencalais.com/ (it's a Drupal site, so you shouldn't have any trouble ;))
- Download and extract the Calais and RDF modules into your modules directory, as per usual (typically, sites/all/modules/calais and sites/all/modules/rdf).
- Download and extract the ARC 2 library to a "vendor" folder in the RDF module's directory (sites/all/modules/rdf/vendor/arc).
- Enable the following modules:
- Calais package
- Calais API
- RDF package
- Calais package
- Navigate to Administer >> Site configuration >> Calais Configuration (admin/settings/calais) and enter your Calais API key.
- Click the Calais Node Settings tab (admin/settings/calais/calais-node). There are several collapsed fieldsets here: "Global" and another one for each content type on your system.
- Next, you have to turn on Calais processing on one or more of your content types. Expand one of the content type fieldsets (for example, "Story") and select the type of Calais processing that should be done: whether Calais should merely suggest terms (visible on a tab), or automatically apply the terms it discovers, either once when the content is first inserted, or each time the content is updated.
You can also play with the Relevancy Threshold setting. This dictates at what confidence level a tag will get applied to the node, from 0% (tag it with whatever tags Calais comes up with) to 100% (only the tags it's really sure about).
- If you expand the "Global" fieldset, you'll also see a large list of the vocabularies that Calais knows about. Each of these maps to a Drupal taxonomy vocabulary, which you can view at Administer >> Content management >> Taxonomy (admin/content/taxonomy). These vocabularies will hold the tags that Calais discovers about your content in each area.
You can check and uncheck options here; for example, if your site has a lot of Windows-related content, you might want to uncheck "MedicalCondition" and "MedicalTreatment" so that the words "virus" and "inoculate" are not misinterpreted.
A full list of entities, along with examples, is available at the Calais documentation.
Ok, let's tag some content, already!
- Assuming you went with "Story" before, go ahead and go to Create content >> Story and enter in some text, such as the following which I borrowed from Wikipedia's article on Nintendo DS. You'll notice that the 200 vocabularies created by the Calais module are thankfully hidden from the form.
- Once the content is posted, you'll notice that there is a new "Calais" tab on the node and, optionally, will notice that it has been auto-tagged. You can see that it caught a number of things here, such as the names of companies (Nintendo and Nintendo of America) locations (Australia, Canada, Europe, Japan, and so on), and the general topic area (Technology).
- If you click the Calais tab, you'll see an interface like the following which shows you the various topic areas you selected earlier, the ones that it found terms for highlighted in green.
The terms are weighted based on how certain it is that the article is actually about that topic. Words like "Technology" and "Nintendo" are weighted highest, where other terms like "Canada" and "Australia" are much lower certainty. You can remove a tag by clicking on it to exclude it, and you can also add tags of your own to correct the information and help make Calais smarter. For example, I can turn off all of the country-related items and add a new term under the "Product" vocabulary for "Nintendo DS."
Note that you might also want to install something like Pathauto module to generate nice URLs for your taxonomy terms (ex: products/nintendo-ds rather than taxonomy/term/1234). And if you find Calais routinely getting confused by something, you can use the Calais Tag Modifier module (part of the main Calais package) to set up blacklists for tags (for example, don't count "Other" as a tag) and tag replacements (Color => Colour).
This is cool! What else can I do with it?
- Take advantage of Calais's Views module integration and create lists of Calais terms, lists of nodes filtered by Calais terms, and so on.
- Have loads of legacy content? Use the Calais Bulk Processing feature (admin/settings/calais/bulk-process) to back-tag it all.
- Use the Calais Geo module (part of the Calais module package) to geocode content according to Calais country, city, and state/province tags and show them on a map with GMap module.
- Use the More Like This module to build lists of related content both within and outside your website based on how Calais tags it. It can also pull in YouTube videos, Flickr photos, and more.
- Topic Hubs module can also be used to automatically aggregate content based on an expression that you create (for example, "All content that is about Nintendo -or- is about video games -and- the Wii.") Content can be displayed on a map, in a list, etc. and re-arrange them with the Panels module.
- Have Calais Marmoset mark up your content with RDF that search engine crawlers can understand.
Awesome! I want to learn more! More, I say!
Great! Here are some of the resources used to write this article. Go nuts!
- How does Calais work? A conceptual overview.
- Using Intelligent Web Services for Semantic Drupal Sites Drupalcon DC video Includes lots of screenshots and examples from the Calais suite of modules by the author himself. Highly recommended.
- Calais module demo screencast (slightly outdated) Demonstrates Calais module in action.
- OpenPublish A Drupal installation profile with the various modules already pre-configured.
- DBPedia, an example of an external data source with structured data from Wikipedia which your site can pull from.
- Resource Description Framework (RDF) module handbook
- Calais module handbook