Want to get Lullabot article, videocast, and podcast announcements delivered right to your in-box?
Let us know your email address (we won't share it) and we'll let you know when anything exciting happens.
Drupal.org's search index and Zipf's Law
Similar
- Drupal Voices 64: Narayan Newton on Drupal.org Performance and Scalability
- Performance and Scalability Seminar Slides
- Drupal's search module and scoring factors
- Drupal Voices 65: Konstantin Kafer on Optimizing Javascript and CSS
- Drupal Voices 63: David Strauss on Performance, Scalability, and Pressflow Drupal Distribution
In preparing for my talk about Drupal's search capabilities I've been looking at all sorts of interesting things. One of them is Zipf's Law, which predicts that "the frequency of any word is roughly inversely proportional to its rank in the frequency table". Since Drupal's HTML indexer normalizes the word scores on the assumption that this is true, I thought it would be interesting to see whether the words in Drupal.org's index actually fall into a Zipf-like distribution.
For example, in the Drupal.org search index, the word the is the most common (as it is in the English language in general), occurring 120,306 times out of 11,182,265 (1%)*. Zipf's law predicts that the second most common word (and in the case of Drupal.org) will occur 1/2 as frequently. As the data below shows, this is not the case.
To generate the frequency data, I obtained a copy of Drupal.org's search_index table and ran the following queries to calculate the count and rank of the words:
SELECT @m:=0;
SELECT @m:=@m+1 AS rank, count(word) AS count FROM search_index GROUP BY word ORDER BY count DESC;I then used the application Grapher.app to generate a pure Zipf curve as well as map the points from Drupal.org's index. Here is the result:

Zipf's law and the words (frequency/rank) of the Drupal.org search index.
The X and Y axes are both logarithmic which makes both curves appear more or less as straight lines. The slope of the Drupal.org line is not -1, and in fact, it changes along the way. Nonetheless, it shows that the search index follows some sort of inverse rank distribution, and Zipf is likely a close fit.
The graph could be made to be much nicer (for example by actually interpolating the Zipf equation on top of the Drupal.org data), but this was my first use of the Grapher.app, and I still have many things to learn about it.
Top words on Drupal.org
Below is a list of the top 20 words (out of 276046) in the search index on Drupal.org and their frequency.
| Rank | Count | Word |
|---|---|---|
| 1 | 120306 | the |
| 2 | 100101 | and |
| 3 | 91714 | for |
| 4 | 87627 | drupal |
| 5 | 84658 | thi |
| 6 | 73740 | that |
| 7 | 73558 | with |
| 8 | 72421 | modul |
| 9 | 67653 | not |
| 10 | 67338 | have |
| 11 | 65356 | but |
| 12 | 58015 | you |
| 13 | 57437 | can |
| 14 | 52618 | page |
| 15 | 49019 | from |
| 16 | 48498 | work |
| 17 | 47842 | how |
| 18 | 47406 | user |
| 19 | 47394 | site |
| 20 | 46835 | there |
Some of the words show evidence of stemming (via the Porter Stemmer module), which is the practice of taking words like module and modules and reducing both of them to modul, which is the common stem that they both share.
Calling all math-heads!
Here is a zip file with the raw data and a basis Grapher.app file with Zipf's law. If you have skills doing scientific analysis of data, can make a better graph than I have done, or want to explore Zipf's law further, please take the data and work your magic. Please report back here with whatever you find.
If you want to get started with Zipf's law in Grapher.app, here is the text that you can paste in order to get the basis equation:
f({k;N,s})=|_frac_{{|_frac_{{1};{k^{s}}}};{|_sum_{{n=1};{N};{|_frac_{{1};{n^{s}}}}}}}
Here is a LaTeX expression of the same:
f\left( k;N,s \right)=\frac{\frac{1}{k^{s}}}{\sum_{n=1}^{N}{\frac{1}{n^{s}}}}
Comments on this post will automatically be closed three months from the original post date.
| Attachment | Size |
|---|---|
| zipf.zip | 2.68 MB |



RSS Feed



Comments
Other areas where Zipf and Drupal meet
I should point out that Dries observed a Zipf pattern in the distribution of Drupal documentation and support requests.
And a graph!
Robert - that is *thickly* geeky - and very cool. Sometimes, that stuff is very freaky when you start to see patterns emerge. Thanks for that.