Drupal's search module and scoring factors

Ranking search results

This article applies to Drupal 5.x.

In this article I will show how the results of the search module can be fine tuned using controls available to Drupal site administrators. The search module's configuration options include up to four extra parameters called scoring factors for weighting search results based on keyword relevance, recency, number of comments, and the number of page views. It will be shown that adjusting these values can dramatically alter and improve the order of search results. We will then add a theme function to enhance the themed search items by displaying their score. Finally, we will extend the advanced search form to include the scoring factor controls so that every search can be custom tailored with regards to the scoring algorithm.

Scoring factors

Four types of scoring factors are available to Drupal administrators:

  • relevance of keyword
  • recency (created, changed, last comment)
  • number of comments (if comment module is turned on)
  • number of page views (if statistics module is turned on AND if Count content views is enabled. See admin/logs/settings)








Drupal's search administration interface has controls for the scoring factors

The search module's scoring factors (admin/settings/search)

If you don't see the page views scoring factor, it means you don't have the statistics module enabled and configured properly. Enable the statistics module and make sure that Count content views is also enabled.








The Drupal statistics module can influence search results

The statistics module needs to be enabled and Count content views turned on in order for the page view scoring factor to work.

The weights given to each scoring factor have a profound effect on the order of search results, and it is well worth your while testing different values in order to achieve the best possible search result ranking. The scoring factors can be changed at any time and take effect immediately. There is no need to re-index your site.

Four different nodes

In order to demonstrate the affect of scoring factors on scoring I have created four nodes, each of which scores especially high with one scoring factor. The first node has the word Drupal in both the Title and in the Body. Since the Title field gets extra weight (due to being wrapped in an <h1> tag), and also due to the fact that Drupal appears twice in the node, this node will score very high for the keyword relevance scoring factor when searching for Drupal.

The second node contains the word Drupal in the Body, and also has a comment. As it is the only node that has a comment, it will score the highest for the comment count scoring factor.

The third node has been viewed 50 times, whereas the others have each been viewed only once. Node #3 will score the highest for the page view scoring factor.

Finally, the fourth node is the newest, being created after all of the others. Thus, node #4 will score highest for the recency scoring factor.

In summary, there are four nodes, each of which is designed to have a special advantage over the others in one scoring factor.








Drupal search results with default scoring factors

With four nodes and the default scoring factor weights, searching for "Drupal" favors the node with a comment over the others.

Displaying the score

In order to better observe how search results are ranked, we will now override the theme_search_item function and extend it to output each search item's score. Seeing the scores of items and watching them change in response to various score factor weights will help you decide which settings are optimal for your site.

To display the score on each themed search item, add this function to the template.php file in your theme's directory. If you are using the Garland theme, for example, this function should be added to /themes/garland/template.php.

/** * Format a single result entry of a search query. This function is normally * called by theme_search_page() or hook_search_page(). * * @param $item * A single search result as returned by hook_search(). The result should be * an array with keys "link", "title", "type", "user", "date", and "snippet". * Optionally, "extra" can be an array of extra info to show along with the * result. * @param $type * The type of item found, such as "user" or "node". * * @ingroup themeable */ function phptemplate_search_item($item, $type) { $output = ' '. check_plain($item['title']) .''; $info = array(); if ($item['type']) { $info[] = $item['type']; } if ($item['user']) { $info[] = $item['user']; } if ($item['date']) { $info[] = format_date($item['date'], 'small'); } if (is_array($item['extra'])) { $info = array_merge($info, $item['extra']); }

// Add the score to the list of items displayed in search results $info[] = $item['score'];

$output .= ' '. ($item['snippet'] ? '

'. $item['snippet'] . '

' : '') . '

' . implode(' - ', $info) .'

'; return $output; }








Code added to theme_search_item to show score

Two lines have been added to the theme_search_item function.

Now when you search, each search result will display its score. Here is the search results page for a search on Drupal with the four nodes I have created and default values for all of the score factors.








Search results that show the ranking score

Overriding theme_search_item allows us to see how each node has scored in the ranking algorithm.

Boosting keyword relevancy

When looking at the search results for Drupal using the default scoring factors, it is noteworthy that node #1 ranks second in the results. Why? Because it has the word Drupal in the title and in the body. While this guarantees that node #1 will score highest in the keyword relevancy factor, it seems that overall, the comment count factor (or some other aspect of the scoring algorithm) favors comments more than keywords. Lets boost the keyword relevancy scoring factor by +2 and repeat the search.








Search results with the keyword scoring factor increased

By boosting the keyword relevancy scoring factor, the node with Drupal in the title now ranks first in the results.

Adding the scoring factor widget to advanced search

Drupal's advanced search feature lets you construct many specific and interesting search queries. You can, for example, search for all Page nodes that have the taxonomy term Politics but not the word Bush. This is one realm where Drupal consistently beats the search results delivered by external search engines such as Yahoo! or Google. Drupal simply knows more about its own content and is thus more capable of searching through it in a structured manner.

Drupal doesn't give you any options for how to sort or score the search results. Since the score factor weights are only used during the actual searching, and not during indexing, there is nothing stopping us from applying custom factor weights to every search. We will now add the score factor weight controls currently found in the search administration section to the advanced search form so that any user can tweak the weights to get the search results they are most interested in.

The node module uses the HTML Analyzer and Indexer provided by the search module to implement Drupal content searches. The node module adds the advanced search form to the basic search form in its implementation of hook_form_alter. Thus we turn to node_form_alter to add the score factor controls to the advanced search form.

  
// Grab the administration form from node_search
$factors = node_search('admin');

// Get rid of the help text because it takes up too much space
unset($factors['content_ranking']['info']);

// Get rid of the fieldset
$form['advanced']['factors'] = $factors['content_ranking']['factors'];

// Wrap the form elements in a div to hold them together.
$form['advanced']['factors']['#prefix'] = '';
$form['advanced']['factors']['#suffix'] = '';
  

Code added to node_form_alter to add scoring factor controls to advanced search.

The node module handles the validation of the advanced search form in the node_search_validate function. This is where all of the various conditions, such as taxonomy terms, node types and NOT keywords are turned into a keyword query that is usable by the search module. We will extend node_search_validate to also store information about the user's scoring factor preferences in the session.

  

if (isset($form_values['node_rank_comments'])) {
  $_SESSION['node_rank_comments'] = $form_values['node_rank_comments'];
}
if (isset($form_values['node_rank_relevance'])) {
  $_SESSION['node_rank_recent'] = $form_values['node_rank_recent'];
}
if (isset($form_values['node_rank_views'])) {
  $_SESSION['node_rank_relevance'] = $form_values['node_rank_relevance'];
}
if (isset($form_values['node_rank_recent'])) {
  $_SESSION['node_rank_views'] = $form_values['node_rank_views'];
}
  

Code added to node_search_validate to store scoring factor preferences during searhing.

The need to store these preferences stems from the fact that the search module accepts a POST request from the search form and then resubmits the form resulting in a GET request with the keyword query in the URL. It is on the second GET request that the search is actually executed and the initial POST values are not available. The POST-to-GET redirect is to enable bookmarking of searches and is one of Drupal's nice features. It means, however, that the POST values for the scoring factor are not available at the time the search query is built. The solution chosen here is to put them into the $_SESSION variable until the are used, at which point they are removed from the $_SESSION. The alternative would have been to make them actual search query terms, as is done with all of the other advanced search form elements. This option resulted in long search queries. The merits of both approaches can be discussed further, but the approach using the $_SESSION is the one being used for this article.

Upon the GET redirect, the node module builds a specific search query in node_search. Here is a sample of the code from that function which make use of the scoring factor values stored in the $_SESSION.

  
$weight = $_SESSION['node_rank_relevance'];
unset($_SESSION['node_rank_relevance']);
$weight = empty($weight) ? (int)variable_get('node_rank_relevance', 5) : $weight;
if ($weight) {
  // Average relevance values hover around 0.15
  $ranking[] = '%d * i.relevance';
  $arguments2[] = $weight;
  $total += $weight;
}
  

Code from node_search which takes $weight first from the $_SESSION, and otherwise from the default variable_get().

In the code above, $weight is the scoring factor. It is first taken from the session variable. If that has not been set, then the traditional value is taken from variable_get(). The weight is then used to construct a SQL snipped which is used in the final search query.

The patch containing all of the code for this feature is attached. It applies to Drupal 5.1.








The Drupal advanced search form with scoring factor widgets

The advanced search form with the scoring factor controls added.

One goal of this article is to encourage Drupal administrators to experiment with the scoring factor controls. It would be interesting to hear from others which combination of values works best. Another goal of the article is to introduce the idea of having the scoring factor controls present in the advanced search form. Feedback on this idea, its implementation, and the results is very welcome. Drupal's built-in search module has a lot of potential, but some configuration may be needed before it returns optimal results.

Get in touch with us

Tell us about your project or drop us a line. We'd love to hear from you!