by Angie Byron on April 9, 2009 // Short URL

Introduction to Calais

Calais? I get e-mail about that at least 12,238 times a day. Buzz off.

No, not Cialis, silly... Calais (pronounced cull-AY). It's a free (as in cost) natural language processing, rich semantic metadata, web service, uh... thingy. The video on the front page is jargon-tastic for those into that sort of thing. But to cut to the chase, it basically reads in text from your site and, based on the bazillions of others sites' text from other people using the service, it figures out some sensible tags for you automatically so your editors don't have to do it.

However, rather than just being a simple flat set of free tags, the tags are instead grouped into areas analogous to Drupal's taxonomy vocabularies which are associated with "Entities" (People, Companies, Cities, etc.), "Facts" (a person's Position or relationships between entities), and "Events" (Sporting, ManagementChange, etc.). The format passed back is in an open, semantic web-compatible format (Resource Description Framework or RDF) which then allows you to form intelligent relationships between articles based on the subject matter. This can be used for things like assisting with SEO, getting better search results, creating an "Other articles like this" block, pulling in external data from other sources that speak RDF, or whatever else you can imagine doing with this kind of information.

How does it work?

To get an idea of how it works, chuck some text at http://viewer.opencalais.com/. For example, here's a Wikinews teaser about President Obama:

Calais Example

It correctly identifies the topic of the article as "Politics," "Washington,United States" as a "City" that the article is about, "Barack Obama" and "George W. Bush" as "Person" entities, and even finds a quotation by George W. Bush (though sadly, not as entertaining as some). Pretty nifty!

So, wanna wire this up with Drupal? Let's find out how!

Stuff you will need

  • Calais module: http://drupal.org/project/opencalais. I'm using 6.x-3.1
  • Resource Description Framework (RDF) module: http://drupal.org/project/rdf. I'm using 6.x-1.0-alpha7
  • ARC 2 RDF classes for PHP: http://arc.semsol.org/download. I'm using ARC 2 from 2009-03-05.
  • A Calais API key from http://opencalais.com/ (it's a Drupal site, so you shouldn't have any trouble ;))

Installation

  1. Download and extract the Calais and RDF modules into your modules directory, as per usual (typically, sites/all/modules/calais and sites/all/modules/rdf).
  2. Download and extract the ARC 2 library to a "vendor" folder in the RDF module's directory (sites/all/modules/rdf/vendor/arc).
  3. Enable the following modules:
    • Calais package
      • Calais
      • Calais API
    • RDF package
      • RDF

Configuration

  1. Navigate to Administer >> Site configuration >> Calais Configuration (admin/settings/calais) and enter your Calais API key.
  2. Click the Calais Node Settings tab (admin/settings/calais/calais-node). There are several collapsed fieldsets here: "Global" and another one for each content type on your system.
  3. Next, you have to turn on Calais processing on one or more of your content types. Expand one of the content type fieldsets (for example, "Story") and select the type of Calais processing that should be done: whether Calais should merely suggest terms (visible on a tab), or automatically apply the terms it discovers, either once when the content is first inserted, or each time the content is updated.

    You can also play with the Relevancy Threshold setting. This dictates at what confidence level a tag will get applied to the node, from 0% (tag it with whatever tags Calais comes up with) to 100% (only the tags it's really sure about).

    Calais node settings

  4. If you expand the "Global" fieldset, you'll also see a large list of the vocabularies that Calais knows about. Each of these maps to a Drupal taxonomy vocabulary, which you can view at Administer >> Content management >> Taxonomy (admin/content/taxonomy). These vocabularies will hold the tags that Calais discovers about your content in each area.

    You can check and uncheck options here; for example, if your site has a lot of Windows-related content, you might want to uncheck "MedicalCondition" and "MedicalTreatment" so that the words "virus" and "inoculate" are not misinterpreted.

    A full list of entities, along with examples, is available at the Calais documentation.

    Calais entities settings

Ok, let's tag some content, already!

  1. Assuming you went with "Story" before, go ahead and go to Create content >> Story and enter in some text, such as the following which I borrowed from Wikipedia's article on Nintendo DS. You'll notice that the 200 vocabularies created by the Calais module are thankfully hidden from the form.

    Creating a story

  2. Once the content is posted, you'll notice that there is a new "Calais" tab on the node and, optionally, will notice that it has been auto-tagged. You can see that it caught a number of things here, such as the names of companies (Nintendo and Nintendo of America) locations (Australia, Canada, Europe, Japan, and so on), and the general topic area (Technology).

    Viewing a story's tags

  3. If you click the Calais tab, you'll see an interface like the following which shows you the various topic areas you selected earlier, the ones that it found terms for highlighted in green.

    The terms are weighted based on how certain it is that the article is actually about that topic. Words like "Technology" and "Nintendo" are weighted highest, where other terms like "Canada" and "Australia" are much lower certainty. You can remove a tag by clicking on it to exclude it, and you can also add tags of your own to correct the information and help make Calais smarter. For example, I can turn off all of the country-related items and add a new term under the "Product" vocabulary for "Nintendo DS."

    Calais interface for story

Note that you might also want to install something like Pathauto module to generate nice URLs for your taxonomy terms (ex: products/nintendo-ds rather than taxonomy/term/1234). And if you find Calais routinely getting confused by something, you can use the Calais Tag Modifier module (part of the main Calais package) to set up blacklists for tags (for example, don't count "Other" as a tag) and tag replacements (Color => Colour).

This is cool! What else can I do with it?

  • Take advantage of Calais's Views module integration and create lists of Calais terms, lists of nodes filtered by Calais terms, and so on.
  • Have loads of legacy content? Use the Calais Bulk Processing feature (admin/settings/calais/bulk-process) to back-tag it all.
  • Use the Calais Geo module (part of the Calais module package) to geocode content according to Calais country, city, and state/province tags and show them on a map with GMap module.
  • Use the More Like This module to build lists of related content both within and outside your website based on how Calais tags it. It can also pull in YouTube videos, Flickr photos, and more.
  • Topic Hubs module can also be used to automatically aggregate content based on an expression that you create (for example, "All content that is about Nintendo -or- is about video games -and- the Wii.") Content can be displayed on a map, in a list, etc. and re-arrange them with the Panels module.
  • Have Calais Marmoset mark up your content with RDF that search engine crawlers can understand.

Awesome! I want to learn more! More, I say!

Great! Here are some of the resources used to write this article. Go nuts!

Angie Byron

Powered by Drupal!

Comments

Frank Febbraro

Wow

Hey Angie,

Thanks so much for the writeup. You know, I should have gotten around to write something like this, but now that I have read yours I know I could not have done much better. Thanks for the flurry of issues today too, the code is getting better just from you kicking the tires.

One other thing to note, as of release 6.x - 3.1 (one of your screen shots was from version 3.0) there is now SemanticProxy integration. This will return to you the Calais metadata for content that lives at a URL and not on your site directly. This comes in VERY useful when you have RSS feeds that only provide a sentence or two of text in the body. Calais usually needs a bit more content, so the SementicProxy integration will take either FeedAPI's original item URL, or any CCK Text or Link field (as long as it is a valid URL), grab the content at that location, parse out the real article (removes nav, ads, etc) and submits that to Calais for processing, and returns the tags to you.

Thanks again, I'm going to add a link to this on the module page. Great resource.

Reply

angie

Yay!! :D

I was praying you'd say something like that and not "DEAR LORD. HOW CLUELESS COULD YOU POSSIBLY BE?" ;)

That SemanticProxy integration sounds awesome. Thanks for your kind comments! Also, thanks again for being a good sport (and so responsive!) in the issue queues today. :)

Reply

Jacob Redding

wonderful..

Thanks for taking the time to post this. This is a great introduction to the Calais service as well as how to use it in Drupal.

Reply

Moshe Weitzman

The Calais module is great.

The Calais module is great. The quality of the tags are not so hot in my experience. It probably depends a lot on your content. My client's case was molecular biology - guess calais ain't up to speed there.

Reply

AndyW

I've been playing around

I've been playing around with Drupal and Calais recently here:

http://rss001.com/

It's great if you are interested in the semantic web as Calais is so easy to set up - whereas other RDF-based Drupal modules are not, although Arto Bendiken is doing something about that at the moment by writing some user guides.

Reply

snufkin

need input

is it something similar like yahoo's auto tagging feature? What sort of language processing are you doing on the server? Can it handle multiple languages?

Reply

Matthew

Automatic Linking?

Could you use something like this to automatically parse data and create relevant hyperlinks? For example, say it finds the term "Nintendo DS" in the article, and on your site a node exists titled "Nintendo DS", automatically creating a link to that node?

Reply

Grugnog

Despite the "open" in the

Despite the "open" in the "opencalais" url, the service itself is free as in beer, not (as far as I can tell) speech - either the underlying database or the analysis code. The Drupal module is free as in speech of course, and it's a great free service of course but I thought this was worth pointing out.

Reply

Krista

thank you!

Hi Angie -- thanks so much from all of us on the Calais team for the time and effort you put into this helpful piece. We have been all atwitter about it.

For 'need input' - we are strongest in English, but since Calais 4.0, we also cover French and we are now rapidly adding Spanish. Some five more languages are in the pipeline for next year.

Here is more info on how the service works: http://www.opencalais.com/about

Reply

snufkin

thanks

Thanks for the answer Krista!

Reply

Anonymous

And so does your mother and

And so does your mother and your sister. And both are lame, too. Thanks for the input.

Great read Angie! Thanks for the eyeopener about this stuff.

Reply

Evan Leeson

Bravo

Still the best bridge between the unintelligible and normal humans. You rock Angie.

Reply

Sunil

What about non english languages

I wonder this will work for non-english languages. Example any Indian language like Hindi or Malayalam.

If your comment is "YES" I want to try this in my site. :)

Regards,
-S-

Reply

Luis

No, I don't think it's

No, I don't think it's working with other languages, take the phrase: "Bush performed his final official act"

According to Calais Viewer, you have:

a verb: perform
a relationsubject: George W. Bush
a relationobject: his final official act

I don't know about Indian, but spanish uses lots of irregular verbs, in this case take:

"performed"

a computer can easily know we are talking about the 'past', just because we added 'ed' at the end of 'perform'

But take that to spanish, In 'present' I would say something like "hace" while in 'past' I would say something like 'hizo', I even have another word to say the same think, 'realizó' (yes, using accents doesn't make it easier, doesn't it? :)

I hope somebody could explain how to use RDF for other languages, but I just don't know if that's even possible. I just can't wait for a lullabot tutorial on how to use the RDF module with drupal..

Reply

domineaux

Open Calais - what about it REALLY

I posted this on Drupal Forums then I thought this might be a good place to ask.

Reading the Open Calais agreement it says you must keep their logo as is, and associated links must link to their home page.

http://drupal.org/project/opencalais

Now, I thought before I enabled this thing I should make a posting to get some input from Drupalers.

Most of us are interested in content, especially if it is easy to acquire that along with wider exposure to more of the web.

SEO is a big enough pain keeping your sites showing up as well as you can in the searches.

Open Calais reminds me of Zemanta, which is nothing but a hyperlink Spam tool in disguise. I never quite got to the heart of the matter with Zemanta, but I did remove it from all my sites. Zemanta pokes all kinds of links into your site content when you use their tools that facilitate easy content development on your sites. If you look at you content in the Unfiltered HTML after applying some Zemanta there is so much junk in the code it's a wonder it runs.

I'm not knocking anyone. I just like to know the facts before I eat the cheese, it may be rotten. LOL

After reading the Open Calais agreement I thought better of applying it on the site I'm working. I honestly don't know it will make that much difference, and it sure as heck can alter meta data on my site. Maybe I would get better exposure on the web or maybe not.

There is a paid for services available, but I haven't reviewed it. It may not have all the linking requirements of use. I don't have a big problem with " A link", but a myriad of links and junk to other sites is nothing, but a hyperlink spam tool disquised as an SEO and easy content builder, i.e., Zemanta

This article is good at explaining how to install it, but doesn't address pervasive issues I'm talking about.

If you have experience with the Open Calais and associated modules I sure appreciate reading any comments or suggestions. Like everyone else I can always use help with SEO and Content, so I would enjoy to read good reviews. However, I'm not into ignoring factual information or comments good or bad.

There are some pretty good tools to help with SEO and metadata now in Drupal modules. The outside services with hooks and API seem to have their exploits.

Everything appears to be OK. The danger is always in the details. Cautiously I approach anything that sounds too good, and more so when it's free.

Reply

Max

Drilling into topics

First I got really excited about Calais module for Drupal, but now I wonder how exactly I could use it to my needs. The default vocabularies that provided are too generic for my needs. Let's say we have a specific field like real estate (which probably falls under Business vocabulary), I need to classify content within this field under different topics i.e. Foreclosures, Financing, New Developments etc. and then create Topic Hubs around each topic. Can Calais help me that? Is there a way to create custom Vocabularies with terms and let Calais use them as well? I couldn't see that functionality of the box, unless I totally missed it.

Thank you in advance!

Reply

Todd

Not really applicable to specialized blogs

Well, I gave this a good ol' college try -- and I'm less than impressed.

Background:
My site is for posting reviews of beers. Primarily by myself, though it is open for folks to register and submit their own.

I already have Vocabularies (populated with dozens and dozens of Terms), for Locations (i.e. Continent --> Country --> Region --> State/Province --> Brewery, for example "North American --> USA --> Western US --> California --> Ballast Point Brewing Co.)

I also have more specialized Vocabularies, in the form of [Beer] Styles (e.g. "Ale --> Pale Ale"), a Score Vocabulary (1-9), and a generic Tags Vocab (free-tagging for any other tidbits the reviewer wants to include).

Expectations:
I was hoping Calais would be able to help me dig through my content, to pick out tags that may not be obvious when looking at Node after another. Maybe I've used the term "hoppy" enough times that it would be worth tagging, for example.

Installation:
Not so tough, though I did feel that I was entering Dependency Hell (ala RPM) with the various modules needing to be installed and manually entered in specific directories. But no major headaches.

Setup:
Not nearly so straightforward. Even after a few days (off-and-on), I've still not figured out what I'm doing re: RDF. What are RDF Mappings? Can I fill my custom Calais Repository with tags that I want to dig for, or not? And wtf is a Schema? And URI? And CURI? And, so on. And so on. Feels like this would be the best place to customize my tagging, but I'll be damned if I know how.

Actual Results:
I experimented for a while, and was not happy with either the Tags that it set to my Nodes, nor the fact that my Vocabs page was packed full of nonsense. For example, these are the *ONLY* Calais Entities:

Anniversary
CalaisDocumentCategory
City
Company
Continent
Country
Currency
EmailAddress
EntertainmentAwardEvent
EventsFacts
Facility
FaxNumber
Holiday
IndustryTerm
MarketIndex
MedicalCondition
MedicalTreatment
Movie
MusicAlbum
MusicGroup
NaturalDisaster
NaturalFeature
OperatingSystem
Organization
Person
PhoneNumber
Position
Product
ProgrammingLanguage
ProvinceOrState
PublishedMedium
RadioProgram
RadioStation
Region
SportsEvent
SportsGame
SportsLeague
TVShow
TVStation
Technology
URL

I suppose these would be fine for a newspaper site, that covers a wide range of rather typical journalistic content. (Calais is a Rueters product, no less).

But even in the few cases where Calais found something in my Content that could conceivably work, ir didn't do a very good job of it. "Vienna Lager" is not a City. Nor is "Buffalo chicken wings". Under the Technology Entity, Calais tagged about 40% of my Content with "Ale", and 20% with the tag "Carbonation".

I often mention music groups and/or sporting events in my reviews, as background info ("I had this beer while watching the World Series...", or "Was listening to the new Metallica album..."). Calais failed miserably here.

Conclusions:
My content is too specialized for the overly generic tagging that Calais seems to be able to do. Even if I did not already have my content Tagged with, for example, Country and State/Province data, I'm not convinced Calais would find enough useful metadata to be worth the trouble. In another example, Calais tagging my Content with "Ale" is hardly useful.

It is not a metadata spider or crawler for key words to tag. It instead tries to apply tags the best way it can, using generic tagging that would only really work with a news-related blog.

So, if you have a journalistic site, that is in need of rather generic tagging that you'll find on any other news-related site -- that would fit into the very strict Entities listed above -- Calais is for you.

But for a much more specialized Blog (be it for beer, collectible coffee mugs, or whatever else) -- and you don't have a firm enough grasp of RDF to tweak the inner-workings to shoehorn Calais to work with your site -- I just don't see Calais being worth the trouble.

Cheers,
//TB

Reply

Todd

Not really applicable to specialized blogs

Well, I gave this a good ol' college try -- and I'm less than impressed.

Background:
My site is for posting reviews of beers. Primarily by myself, though it is open for folks to register and submit their own.

I already have Vocabularies (populated with dozens and dozens of Terms), for Locations (i.e. Continent --> Country --> Region --> State/Province --> Brewery, for example "North American --> USA --> Western US --> California --> Ballast Point Brewing Co.)

I also have more specialized Vocabularies, in the form of [Beer] Styles (e.g. "Ale --> Pale Ale"), a Score Vocabulary (1-9), and a generic Tags Vocab (free-tagging for any other tidbits the reviewer wants to include).

Expectations:
I was hoping Calais would be able to help me dig through my content, to pick out tags that may not be obvious when looking at Node after another. Maybe I've used the term "hoppy" enough times that it would be worth tagging, for example.

Installation:
Not so tough, though I did feel that I was entering Dependency Hell (ala RPM) with the various modules needing to be installed and manually entered in specific directories. But no major headaches.

Setup:
Not nearly so straightforward. Even after a few days (off-and-on), I've still not figured out what I'm doing re: RDF. What are RDF Mappings? Can I fill my custom Calais Repository with tags that I want to dig for, or not? And wtf is a Schema? And URI? And CURI? And, so on. And so on. Feels like this would be the best place to customize my tagging, but I'll be damned if I know how.

Actual Results:
I experimented for a while, and was not happy with either the Tags that it set to my Nodes, nor the fact that my Vocabs page was packed full of nonsense. For example, these are the *ONLY* Calais Entities:

Anniversary
CalaisDocumentCategory
City
Company
Continent
Country
Currency
EmailAddress
EntertainmentAwardEvent
EventsFacts
Facility
FaxNumber
Holiday
IndustryTerm
MarketIndex
MedicalCondition
MedicalTreatment
Movie
MusicAlbum
MusicGroup
NaturalDisaster
NaturalFeature
OperatingSystem
Organization
Person
PhoneNumber
Position
Product
ProgrammingLanguage
ProvinceOrState
PublishedMedium
RadioProgram
RadioStation
Region
SportsEvent
SportsGame
SportsLeague
TVShow
TVStation
Technology
URL

I suppose these would be fine for a newspaper site, that covers a wide range of rather typical journalistic content. (Calais is a Rueters product, no less).

But even in the few cases where Calais found something in my Content that could conceivably work, ir didn't do a very good job of it. "Vienna Lager" is not a City. Nor is "Buffalo chicken wings". Under the Technology Entity, Calais tagged about 40% of my Content with "Ale", and 20% with the tag "Carbonation".

I often mention music groups and/or sporting events in my reviews, as background info ("I had this beer while watching the World Series...", or "Was listening to the new Metallica album..."). Calais failed miserably here.

Conclusions:
My content is too specialized for the overly generic tagging that Calais seems to be able to do. Even if I did not already have my content Tagged with, for example, Country and State/Province data, I'm not convinced Calais would find enough useful metadata to be worth the trouble. In another example, Calais tagging my Content with "Ale" is hardly useful.

It is not a metadata spider or crawler for key words to tag. It instead tries to apply tags the best way it can, using generic tagging that would only really work with a news-related blog.

So, if you have a journalistic site, that is in need of rather generic tagging that you'll find on any other news-related site -- that would fit into the very strict Entities listed above -- Calais is for you.

But for a much more specialized Blog (be it for beer, collectible coffee mugs, or whatever else) -- and you don't have a firm enough grasp of RDF to tweak the inner-workings to shoehorn Calais to work with your site -- I just don't see Calais being worth the trouble.

Cheers,
//TB

Reply