by Robert Douglass on March 29, 2007 // Short URL

Drupal's search module and scoring factors

This article applies to Drupal 5.x.

In this article I will show how the results of the search module can be fine tuned using controls available to Drupal site administrators. The search module's configuration options include up to four extra parameters called scoring factors for weighting search results based on keyword relevance, recency, number of comments, and the number of page views. It will be shown that adjusting these values can dramatically alter and improve the order of search results. We will then add a theme function to enhance the themed search items by displaying their score. Finally, we will extend the advanced search form to include the scoring factor controls so that every search can be custom tailored with regards to the scoring algorithm.

Scoring factors

Four types of scoring factors are available to Drupal administrators:

  • relevance of keyword
  • recency (created, changed, last comment)
  • number of comments (if comment module is turned on)
  • number of page views (if statistics module is turned on AND if Count content views is enabled. See admin/logs/settings)

The search module's scoring factors (admin/settings/search)

If you don't see the page views scoring factor, it means you don't have the statistics module enabled and configured properly. Enable the statistics module and make sure that Count content views is also enabled.

The statistics module needs to be enabled and Count content views turned on in order for the page view scoring factor to work.

The weights given to each scoring factor have a profound effect on the order of search results, and it is well worth your while testing different values in order to achieve the best possible search result ranking. The scoring factors can be changed at any time and take effect immediately. There is no need to re-index your site.

Four different nodes

In order to demonstrate the affect of scoring factors on scoring I have created four nodes, each of which scores especially high with one scoring factor. The first node has the word Drupal in both the Title and in the Body. Since the Title field gets extra weight (due to being wrapped in an <h1> tag), and also due to the fact that Drupal appears twice in the node, this node will score very high for the keyword relevance scoring factor when searching for Drupal.

The second node contains the word Drupal in the Body, and also has a comment. As it is the only node that has a comment, it will score the highest for the comment count scoring factor.

The third node has been viewed 50 times, whereas the others have each been viewed only once. Node #3 will score the highest for the page view scoring factor.

Finally, the fourth node is the newest, being created after all of the others. Thus, node #4 will score highest for the recency scoring factor.

In summary, there are four nodes, each of which is designed to have a special advantage over the others in one scoring factor.

With four nodes and the default scoring factor weights, searching for "Drupal" favors the node with a comment over the others.

Displaying the score

In order to better observe how search results are ranked, we will now override the theme_search_item function and extend it to output each search item's score. Seeing the scores of items and watching them change in response to various score factor weights will help you decide which settings are optimal for your site.

To display the score on each themed search item, add this function to the template.php file in your theme's directory. If you are using the Garland theme, for example, this function should be added to /themes/garland/template.php.

/**
* Format a single result entry of a search query. This function is normally
* called by theme_search_page() or hook_search_page().
*
* @param $item
* A single search result as returned by hook_search(). The result should be
* an array with keys "link", "title", "type", "user", "date", and "snippet".
* Optionally, "extra" can be an array of extra info to show along with the
* result.
* @param $type
* The type of item found, such as "user" or "node".
*
* @ingroup themeable
*/
function phptemplate_search_item($item, $type) {
$output = '

'. check_plain($item['title']) .'

';
$info = array();
if ($item['type']) {
$info[] = $item['type'];
}
if ($item['user']) {
$info[] = $item['user'];
}
if ($item['date']) {
$info[] = format_date($item['date'], 'small');
}
if (is_array($item['extra'])) {
$info = array_merge($info, $item['extra']);
}

// Add the score to the list of items displayed in search results
$info[] = $item['score'];

$output .= '

'. ($item['snippet'] ? ''. $item['snippet'] . '' : '') . '' . implode(' - ', $info) .'

';
return $output;
}

Two lines have been added to the theme_search_item function.

Now when you search, each search result will display its score. Here is the search results page for a search on Drupal with the four nodes I have created and default values for all of the score factors.

Overriding theme_search_item allows us to see how each node has scored in the ranking algorithm.

Boosting keyword relevancy

When looking at the search results for Drupal using the default scoring factors, it is noteworthy that node #1 ranks second in the results. Why? Because it has the word Drupal in the title and in the body. While this guarantees that node #1 will score highest in the keyword relevancy factor, it seems that overall, the comment count factor (or some other aspect of the scoring algorithm) favors comments more than keywords. Lets boost the keyword relevancy scoring factor by +2 and repeat the search.

By boosting the keyword relevancy scoring factor, the node with Drupal in the title now ranks first in the results.

Adding the scoring factor widget to advanced search

Drupal's advanced search feature lets you construct many specific and interesting search queries. You can, for example, search for all Page nodes that have the taxonomy term Politics but not the word Bush. This is one realm where Drupal consistently beats the search results delivered by external search engines such as Yahoo! or Google. Drupal simply knows more about its own content and is thus more capable of searching through it in a structured manner.

Drupal doesn't give you any options for how to sort or score the search results. Since the score factor weights are only used during the actual searching, and not during indexing, there is nothing stopping us from applying custom factor weights to every search. We will now add the score factor weight controls currently found in the search administration section to the advanced search form so that any user can tweak the weights to get the search results they are most interested in.

The node module uses the HTML Analyzer and Indexer provided by the search module to implement Drupal content searches. The node module adds the advanced search form to the basic search form in its implementation of hook_form_alter. Thus we turn to node_form_alter to add the score factor controls to the advanced search form.

<?php
// Grab the administration form from node_search
$factors = node_search('admin');

// Get rid of the help text because it takes up too much space
unset($factors['content_ranking']['info']);

// Get rid of the fieldset
$form['advanced']['factors'] = $factors['content_ranking']['factors'];

// Wrap the form elements in a div to hold them together.
$form['advanced']['factors']['#prefix'] = '<div class="criterion">';
$form['advanced']['factors']['#suffix'] = '</div>';
?>

Code added to node_form_alter to add scoring factor controls to advanced search.

The node module handles the validation of the advanced search form in the node_search_validate function. This is where all of the various conditions, such as taxonomy terms, node types and NOT keywords are turned into a keyword query that is usable by the search module. We will extend node_search_validate to also store information about the user's scoring factor preferences in the session.

<?php
if (isset($form_values['node_rank_comments'])) {
 
$_SESSION['node_rank_comments'] = $form_values['node_rank_comments'];
}
if (isset(
$form_values['node_rank_relevance'])) {
 
$_SESSION['node_rank_recent'] = $form_values['node_rank_recent'];
}
if (isset(
$form_values['node_rank_views'])) {
 
$_SESSION['node_rank_relevance'] = $form_values['node_rank_relevance'];
}
if (isset(
$form_values['node_rank_recent'])) {
 
$_SESSION['node_rank_views'] = $form_values['node_rank_views'];
}
?>

Code added to node_search_validate to store scoring factor preferences during searhing.

The need to store these preferences stems from the fact that the search module accepts a POST request from the search form and then resubmits the form resulting in a GET request with the keyword query in the URL. It is on the second GET request that the search is actually executed and the initial POST values are not available. The POST-to-GET redirect is to enable bookmarking of searches and is one of Drupal's nice features. It means, however, that the POST values for the scoring factor are not available at the time the search query is built. The solution chosen here is to put them into the $_SESSION variable until the are used, at which point they are removed from the $_SESSION. The alternative would have been to make them actual search query terms, as is done with all of the other advanced search form elements. This option resulted in long search queries. The merits of both approaches can be discussed further, but the approach using the $_SESSION is the one being used for this article.

Upon the GET redirect, the node module builds a specific search query in node_search. Here is a sample of the code from that function which make use of the scoring factor values stored in the $_SESSION.

<?php
$weight
= $_SESSION['node_rank_relevance'];
unset(
$_SESSION['node_rank_relevance']);
$weight = empty($weight) ? (int)variable_get('node_rank_relevance', 5) : $weight;
if (
$weight) {
 
// Average relevance values hover around 0.15
 
$ranking[] = '%d * i.relevance';
 
$arguments2[] = $weight;
 
$total += $weight;
}
?>

Code from node_search which takes $weight first from the $_SESSION, and otherwise from the default variable_get().

In the code above, $weight is the scoring factor. It is first taken from the session variable. If that has not been set, then the traditional value is taken from variable_get(). The weight is then used to construct a SQL snipped which is used in the final search query.

The patch containing all of the code for this feature is attached. It applies to Drupal 5.1.

The advanced search form with the scoring factor controls added.

One goal of this article is to encourage Drupal administrators to experiment with the scoring factor controls. It would be interesting to hear from others which combination of values works best. Another goal of the article is to introduce the idea of having the scoring factor controls present in the advanced search form. Feedback on this idea, its implementation, and the results is very welcome. Drupal's built-in search module has a lot of potential, but some configuration may be needed before it returns optimal results.

Robert Douglass

Comments

Budda

Expanding search weighting

I've had a request regarding the search system a couple of times now and wondered how its best to implement within Drupal.

How would you go about giving more weight to a content-type which is marked as more important - say an Intranets department homepage?

Would using taxonomy suffice, say a content-type is assigned a term "home page" and this term gets priority weight value in the search. if so, is there any way to shoe-horn this in to the existing Drupal node search code as it stands?

Maybe I'm asking in the wrong place...

Reply

robert

That would be a great feature

The way to do it would be to look at node_search ($op = 'search) in node.module and see how there is a series of scoring factor adjustments that are made based on the things I discussed in this article. The quick-n-dirty way would be to add another one of those blocks and add weight based on content type. The better way would be to rip that whole block of code out and build a hook or plugin system for it so that scoring factors could be contributed or modified outside of core. Here's an example of what I mean:

<?php
     
if (module_exists('comment')) {
       
$weight = $_SESSION['node_rank_comments'];
        unset(
$_SESSION['node_rank_comments']);
       
$weight = empty($weight) ? (int)variable_get('node_rank_comments', 5) : $weight;
        if (
$weight) {
         
// Inverse law that maps the highest reply count on the site to 1 and 0 to 0.
         
$scale = variable_get('node_cron_comments_scale', 0.0);
         
$ranking[] = '%d * (2.0 - 2.0 / (1.0 + c.comment_count * %f))';
         
$arguments2[] = $weight;
         
$arguments2[] = $scale;
          if (!
$stats_join) {
           
$join2 .= ' LEFT JOIN {node_comment_statistics} c ON c.nid = i.sid';
          }
         
$total += $weight;
        }
      }
?>

The first indication of something fishy here is that we have to use if (module_exists()). This is already a sign that a hook system might be a better deal. The real problem is the SQL. What's being built is some complicated SQL that will proceed to build two temporary tables, the second a subset of the first, and then make the final select from the second. Knowing how to build the SQL for these scoring factors is very complicated and not well documented. Finding a way to make this whole sub-system intuitive for developers and easier to extend would be a huge win for Drupal.

Reply

tanoshimi

I had this same question and

I had this same question and got round it by installing the 'weight' module. Applying a different weight to each of my content types allowed me to choose the order in which they would appear in the search results page (and in other views, such as via the taxonomy menu), which was exactly the behaviour I wanted.... maybe this would help you too?

Thanks for the great article, by the way, and love the podcasts!

Reply

Pixelstyle

I installed the weight

I installed the weight module, assigned weight to a node-type and some of its nodes, but in the search results the nodes still turn up lower than other nodes.

Should I disable the weighting options on the search settings page? Or are there other things I should change for this to work?

Reply

robert

No easy solution

Unfortunately, the core Drupal search doesn't yet support support adding your own custom scoring factors. Doug Green's views_fastsearch module does, and you can emulate core Drupal search with that by making a view of all nodes and exposing the fastsearch filter. Then, either the weight module has views integration that you can use to affect ordering, or you can write a very simple custom scoring factor for the fastsearch (or get someone like Doug Green to do it for you). The new fuzzysearch module also supports custom scoring factors. You might try it.

Reply

dami

Great article, thanks! I am

Great article, thanks!
I am wondering is there a way to control what (or which part) of content being indexed? Looks like drupal search.module always do a full text index on all content types? I'd like to see, e.g:
1. Prevent indexing on certain content types.
2. Index only node title or teaser, not full text
....

Reply

Anonymous

I'd like "most recent posts" show up first, but I cant get it!

I want my search to do something fairly simple - show the most recent posts first. Sounds easy, right?

So I set the scoring Keyword relevance = 5 and Recently posted = 10 - everything else gets zero because comments and hits have no relevance when I want the most recently posted first.

Still, the results mixes entries from 2002 with entries from 20006 and 2004 quite randomly, and it seems it doesn't find anything from 2007.

Anyone have any ideas as to why the search does this or could do a tutorial on how to get the most recent first?

I've asked in drupal forums, they were of no help there, though they tried. :/ They just suggested I use views, but I'm using the Category module - not taxonomy so views is out of the question.

You'd think a "most recent posts first" in the search results would be pretty simple to get but it's impossible. Shame.

Reply

Anonymous

ps - interestingly enough,

ps - interestingly enough, the most recent posts end up last in the search results. Is this a known bug?

Reply

robert

Have you filed a bug report

Have you filed a bug report or support request in the issue queue on Drupal.org? That would be the appropriate thing to do at this point. Then you can post the link to the issue here so that we can track the issue with you.

Reply

Olle

Partial word search

doesn't seem to work. Try a search for "drup" and you won't get any matches for "drupal". That's a serious limitation.

Reply

Kuba

re

Thanks very much man! I'm just begging my adventure as an amateur site admin and these here tips of yours are like hot man!:D I'm sure it'll come in handy in my future business with site administration. I'll be glad if you post some more info:)

Reply

yan

What I was looking for

Great! That's what I was looking for. I think it should be commited to the core. An even greater feature would be having an option to change the sort order after having done a search (without putting the options again).

Solidarity

Reply

yan

"updated" vs. "created" date

As I already said, this is a great feature. I'm experiencing one problem though: If the results are ordered by date, the "updated" date seems to be used, not the "created" date. Or am I doing something wrong?

Reply

Anonymous

Drupal search is broken, and nobody wants to fix it

I have the same problem. Drupal Search is driving me insane. I find that people in the drupal forums have the same issues but nobody seems to know how to solve them and if there are replies they don't seem to understand the original posters problem. The main things that I find is really wrong with Drupal search:

"Update" date is returned in search not "Created" date. Why? Can this be changed?

Search results return OLDEST POSTS FIRST not most recently posted. That's just wrong.

Are the other options, standalone search engines that you lullabot folks might reccomend instead of the clearly broken Drupal search?

Reply

robert

Solr

There is a Solr project on drupal.org. I haven't tested it, but this could be a solution. Blake Lucchesi's Summer of Code project will also be of interest. It is called fuzzysearch and it is a complete reimplementation of the search index. It isn't finished yet, but is far enough along for early adopters to start poking it.

Reply

nieruchomosci

Hmmm

I think that drupal is taking too much cpu.. on some hostings there are blocking account becouse its too much using cpu,

krakow

Reply

szkolenia

re

Thanks very much man! I'm just begging my adventure as an amateur site admin and these here tips of yours are like hot man!:D I'm sure it'll come in handy in my future business with site administration. I'll be glad if you post some more info

Reply

John Blue

search, advanced search, and CCK created nodes/fields

Any thoughts, references, or leads on search/advanced search with respect to searching fields created vi CCK?

I would like an advanced search page that allow for searching fields created via CCK. Specifically, I have a custom content type with numeric values fields and date fields. I'd like the Advanced Search to search on ranges of values / dates and return only items from that content type that are within the ranges specified.

Thanks in advance,
John Blue

Reply

fearclan

I think that drupal is

I think that drupal is taking too much cpu.. on some hostings there are blocking account becouse its too much using cpu,

krakow

Reply

Anonymous

Great That's what I was

Great That's what I was looking for. I think it should be commited to the core. An even greater feature would be having an option to change the sort order after having done a search without putting the options again

Reply

Anonymous

To fix the partial word

To fix the partial word problem in Drupal 4.7 you can add this module: http://drupal.org/project/porterstemmer It reduces each word in the index to its basic root or stem (e.g. 'blogging' to 'blog') so that variations on a word ('blogs', 'blogger', 'blogging', 'blog') are considered equivalent when searching. Which frankly, should be in the search by default.

Drupal search is broken from the start.

Reply

Dirk Gebhardt

Drupal search ist not working properly

Dear Robert,

I hope you remember me. We saw hat FrosCon in Bonn I belive. I had a lecture about the website www.freelens.com wich was build with Drupal. Now I have a problem with the drupal search. The system is running on 5.3, MySQL 5 and Php 5.

We have the problem that it looks like drupal dos not index any words from node body. It works only for titles, and that´s it. If I try to find any word in a node body, wich is not mentioned in the title, i got no search results.

So in my case that means at least 3 milion words are not indexed. I have read some of the threats in drupal.org about this, but i found no conviniend solution.

Do you have any idea?

Warm regards from cologne.

Dirk

Reply

robert

Don't know without being able to look

Dirk,

I'd have to do some exploration before I could analyze the problem. I'd need to look at the index and watch what happens when cron runs. Are there any errors in your PHP logs from when cron runs?

Reply

klimatyzator

Thanks

Thanks for this article i`m search many weeks,but now i found this information here.
Thanks for help!

Reply

sign

Thanks for this nice

Thanks for this nice article,
I was playing with that code for a while, and it wasn't working for me.

The problem is that if you have in your search settings lets say keyword relevance = 5 and then you want in advanced search change it to 0 it will always use the default search settings.

because in $_SESSION will be 0 then the code below will ask for variable_get instead of using zero because empty(0) = TRUE

<?php
$weight
= empty($weight) ? (int)variable_get('node_rank_relevance', 5) : $weight;
?>

So its worth saying its good to set in your search settings all factors to zero. I hope it wasnt said elsewhere.

Reply

PeterZ

Managing search order of display

This is great functionality to help manage the search order of display. Some additional factors available include:

In Drupal 6, the Search Ranking module adds additional search factors:
- Relevance (keyword relevancy score)
- Sticky
- Promoted
- Recency (time posted)
- Comment (number comments)
- Statistics (number visits)
- Incoming Links (number of other nodes linking to a node increases score)

In Drupal 5, the Views Fast Search Module with its Views Fast Search Node Type Ranking adds an additional scoring factor (which is not available for Drupal 6) for:
- node type

Reply

Anjali

I recently used Drupal for

I recently used Drupal for one of my site but got stuck with the search module.

I am facing the same problem as faced by Dirk.

>>We have the problem that it looks like drupal dos not index any words from node body. It works only for titles, and that´s it. If I try to find any word in a node body, wich is not mentioned in the title, i got no search results.

Please help me out if there is any solution for this.

Reply

Miereneuker

please correct and delete this comment

Code added to node_search_validate to store scoring factor preferences during searhing.

should be

Code added to node_search_validate to store scoring factor preferences during searching.

Reply