Home

Lullabot

Lullabot Ideas

We know stuff. We empower you to know stuff too.

Drupal Module Development Deep Dive Week

London, UK
September 20-24, 2010

Drupal.org's search index and Zipf's Law

Article by Robert DouglassMarch 17, 2007 - 12:48pm

In preparing for my talk about Drupal's search capabilities I've been looking at all sorts of interesting things. One of them is Zipf's Law, which predicts that "the frequency of any word is roughly inversely proportional to its rank in the frequency table". Since Drupal's HTML indexer normalizes the word scores on the assumption that this is true, I thought it would be interesting to see whether the words in Drupal.org's index actually fall into a Zipf-like distribution.

For example, in the Drupal.org search index, the word the is the most common (as it is in the English language in general), occurring 120,306 times out of 11,182,265 (1%)*. Zipf's law predicts that the second most common word (and in the case of Drupal.org) will occur 1/2 as frequently. As the data below shows, this is not the case.

* By default, the HTML indexer drops words shorter than 3 letters which means common English words such as a, it and an will not be indexed. Furthermore, the Drupal.org search index is full of terms (such as chx, killes and rtfm) which are not English words at all. Both of these factors influence the distribution.

To generate the frequency data, I obtained a copy of Drupal.org's search_index table and ran the following queries to calculate the count and rank of the words:

SELECT @m:=0;
SELECT @m:=@m+1 AS rank, count(word) AS count FROM search_index GROUP BY word ORDER BY count DESC;

I then used the application Grapher.app to generate a pure Zipf curve as well as map the points from Drupal.org's index. Here is the result:

Zipf's law and Drupal.org search index

Zipf's law and the words (frequency/rank) of the Drupal.org search index.

The X and Y axes are both logarithmic which makes both curves appear more or less as straight lines. The slope of the Drupal.org line is not -1, and in fact, it changes along the way. Nonetheless, it shows that the search index follows some sort of inverse rank distribution, and Zipf is likely a close fit.

The graph could be made to be much nicer (for example by actually interpolating the Zipf equation on top of the Drupal.org data), but this was my first use of the Grapher.app, and I still have many things to learn about it.

Top words on Drupal.org

Below is a list of the top 20 words (out of 276046) in the search index on Drupal.org and their frequency.

Rank Count Word
1 120306 the
2 100101 and
3 91714 for
4 87627 drupal
5 84658 thi
6 73740 that
7 73558 with
8 72421 modul
9 67653 not
10 67338 have
11 65356 but
12 58015 you
13 57437 can
14 52618 page
15 49019 from
16 48498 work
17 47842 how
18 47406 user
19 47394 site
20 46835 there

Some of the words show evidence of stemming (via the Porter Stemmer module), which is the practice of taking words like module and modules and reducing both of them to modul, which is the common stem that they both share.

Calling all math-heads!

Here is a zip file with the raw data and a basis Grapher.app file with Zipf's law. If you have skills doing scientific analysis of data, can make a better graph than I have done, or want to explore Zipf's law further, please take the data and work your magic. Please report back here with whatever you find.

If you want to get started with Zipf's law in Grapher.app, here is the text that you can paste in order to get the basis equation:

Here is a LaTeX expression of the same:

AttachmentSize
zipf.zip2.68 MB

Comments

March 18, 2007 - 7:11am Robert Douglass

Other areas where Zipf and Drupal meet

I should point out that Dries observed a Zipf pattern in the distribution of Drupal documentation and support requests.

Amy Stephen (not verified) on March 18, 2007 - 10:48pm

And a graph!

Robert - that is *thickly* geeky - and very cool. Sometimes, that stuff is very freaky when you start to see patterns emerge. Thanks for that.

About this 'bot

Robert Douglass

Robert Douglass studied information science at the University of Massachusetts, Lowell. While working for Hype.de and ABRACON.de he learned the art of building enterprise class web applications, serving clients such as...

more

Recent

Drupal Voices 160: Moshe Weitzman on Page Rendering in Drupal 7

Podcast 9.02.2010

Drupal Voices 159: John Albin Wilkins on Drupal 7 Theming

Podcast 9.01.2010

Drupal Voices 158: Emma Jane Hogbin on PHP for Designers

Podcast 8.31.2010

Command Line Basics: More Editing with Vi/Vim

Video 8.31.2010

Lullabot's Back to School Sale

Blog 8.30.2010

Popular

Drupal Voices 160: Moshe Weitzman on Page Rendering in Drupal 7

Podcast 9.02.2010

Drupal Voices 159: John Albin Wilkins on Drupal 7 Theming

Podcast 9.01.2010

Photo galleries with Views Attach

Article 6.01.2009

Announcing BeautyTips, a jQuery Tooltip Plugin

Article 10.20.2008

Install a Local Web Server on Ubuntu

Video 11.14.2007
 
  • Home
  • Services
  • Events
  • Ideas
  • Store

Connect the Bots:

Twitter Facebook YouTube blip.tv All Posts Newsletter
  • Ideas
  • Blog
  • Podcasts
  • Videos
  • About
  • Contact
  • Jobs
  • Services
    • Training
  • Events
    • Training Workshops
    • Other Events
    • Conferences
    • Calendar
  • Products
    • Videos
    • Books
    • Swag
  • Ideas
    • Blog
    • Podcast
    • Videos
  • About
    • Philosophy
    • Team
    • Presskit
  • Contact
    • General
    • Work Inquiries
    • Mailing List