by Juampy NR

Finding related content faster with Apache Solr

We recently fixed a performance issue at the MSNBC project: the More Like This list of content related to the current page was stressing our database servers with slow, complicated MySQL queries. Here is a screenshot of the block in question:

Original More Like This Block

The above block was being generated by Views using Similar By References module. The view was embedded into a Panels Content Type Plugin, similar to a Drupal block but with extra context about where it is being run. Similar By References module is easy to set up: while configuring a Views display, you add a relationship between the current node and the rest of the available content based in similarity of one or more Entity Reference fields.

Here is a screenshot of how the View was set up:

View using Similar by References module

This block worked perfectly for the first few months after launch. Then, as the amount of content grew, we started getting alerts from New Relic due to slow performance on certain queries. The Slow Query Log at New Relic showed us just how slow they were: Extremely!

New Relic's slow query log

Analyzing SQL queries

We started by analyzing these SQL queries to see if we could optimize them with indexes or similar tweaks. Drupal fields are stored in the database in a highly normalized way: every field has its own table. Given this fact, there was no easy way to make that query faster without radically changing the site's construction. That core problem — Drupal's heavily normalized database structure — pointed us in the direction of the solution. We needed a flattened, pre-calculated version of the content that could be sorted and filtered without complex joins. Fortunately, we already had one: the Apache Solr index that powered the site's search features. We were already indexing nodes along with their list of Topics and Issues, and we could query the same Apache Solr index to gather a list of related content for a given node.

Topics and Issues are Drupal taxonomy fields that we use to classify content. They can be consistent site-wide issues like Education, Equality, and Health; or time-sensitive topics like Election 2014 or Dream Act. Because we had them indexed in our Apache Solr server, we thought that we could simply convert the current view into a custom Solr Query that would mimic the same logic to obtain a list of related node nids. Then, we would load these nodes and finally pass them to the theming layer to render the HTML.

Going Solr

Our first attempt was using Apache Solr Views module to build a view with the same conditions that we had in the original view. It looked very promising. but we saw that the resulting Solr query was not what we expected it to be and would require too much work to alter. Moreover, the resulting data needed a considerable amount of preprocessing before being passed to the theming layer.

Next, we tested the More Like This block that ships with Apache Solr Search module. Unfortunately, it wasn't easy to customize the Solr query that it generated so we discarded that option as well.

Search API module was the next solution we tried. We found out that the Acquia Stack (where MSNBC is hosted) and Search API module are not very good friends. Even though there is a module that integrates the two, Apache Solr Search module is recommended instead of Search API because of scalability and performance reasons.

Writing our own Solr query

Given the above findings, we started studying the API of the Apache Solr Search module to generate our own Solr query mimicking our current view:

Given a node, find related nodes in these content types, whose Issues or Topics match. Sort the result by Publishing Date in descending order and just return the latest 4 nids.

Here is a simplified version of the PHP code we used to build that custom Solr query:


// Given the $node variable already contains the currently-viewed node...

// Build the content types filter.
$content_types = array('page', 'article');
$content_type_filter = new SolrFilterSubQuery('OR');
foreach ($content_types as $content_type) {
  $content_type_filter->addFilter('bundle', $content_type);
}

// Build the Issues filter.
$issues = array();
if ($items = field_get_items('node', $node, 'field_issues')) {
  foreach ($items as $issue) {
    $issues[] = 'node:' . $issue['target_id'];
  }
}
$issues_filter = new SolrFilterSubQuery('OR');
foreach ($issues as $issue) {
  $issues_filter->addFilter('sm_field_issues', $issue);
}

// Build the Topics filter.
$topics = array();
if ($items = field_get_items('node', $node, 'field_topics')) {
  foreach ($items as $topic) {
    $topics[] = 'node:' . $topic['target_id'];
  }
}
$topics_filter = new SolrFilterSubQuery('OR');
foreach ($topics as $topic) {
  $topics_filter->addFilter('sm_field_topics', $topic);
}

// Group conditions together.
$main_filter = new SolrFilterSubQuery('AND');
$main_filter->addFilterSubQuery($content_type_filter);
$issues_or_topics = new SolrFilterSubQuery('OR');
$issues_or_topics->addFilterSubQuery($issues_filter);
$issues_or_topics->addFilterSubQuery($topics_filter);
$main_filter->addFilterSubQuery($issues_or_topics);
  

That snippet was the trickiest bit: it's adding filters and grouping commands to the search to mirror the way the View was set up. Since some of the Apache Solr Search API functions can throw PHP exceptions (for example, if Solr Server is down) we wrapped the actual call to the search index in a Try/Catch statement:


try {
  // Create the query and then configure it.
  $query = apachesolr_drupal_query('apachesolr');

  // Specify that we only want nids in the result.
  $query->addParam('fl', 'entity_id');
  // Only return 4 matches.
  $query->addParam('rows', 4);
  // Add the above filters.
  $query->addFilterSubQuery($main_filter);
  // Sort by publish date in descending order.
  $sort_field = 'ds_field_publish_date_sort';
  $sort_direction = 'desc';
  $query->setAvailableSort($sort_field, $sort_direction);
  $query->setSolrsort($sort_field, $sort_direction);

  // Run query and render results if matches are found.
  list($final_query, $response) = apachesolr_do_query($query);
  if ($response->code == '200' && $response->response->numFound > 0) {
    // Extract nids from the response and load them.
    $nids = array();
    foreach ($response->response->docs as $result) {
      $nids[] = $result->entity_id;
    }
    $nodes = node_load_multiple($nids);
    if (count($nodes)) {
      // Build and return the list of teasers.
      return theme('my_theme_callback', array('nodes' => $nodes));
    }
  }
}
catch (Exception $e) {
  watchdog('msnbc_search', 'There was an error while processing More Like This block: %error', array('%error' => $e->getMessage()), WATCHDOG_ERROR);
}
  

You can find and test the full script yourself at this Gist.

The above script will generate the following Solr query for a node with Issues and Topics.


/solr/SOME_SERVER_ID/select?start=0
&rows=4
&fq=((bundle:article OR page)
  AND ((sm_field_issues:"node:453" OR sm_field_issues:"node:457")
    OR (sm_field_topics:"node:13672" OR sm_field_topics:"node:176716")))
&fl=entity_id
&sort=ds_field_publish_date_sort desc
&q=
&wt=json
&json.nl=map
  

We could say that the Solr query has some similarities with an SQL query:

  • It supports SQL style conditions at the fq parameter.
  • It can restrict the fields to be returned at fl.
  • It can sort results by a field at sort.
  • It can limit the amount of results at fq.

You can find the full list of available parameters at the Solr Wiki or by reading the API of the different classes that the Apache Solr Search module implements.

Retrieving and theming results

Our search works! Now, if we load a node which contains Issues and Topics, and we run the above script, we get the following JSON response from Solr:


{"response":{
  "numFound":4812,
  "start":0,
  "docs":[
    {"entity_id":386621},
    {"entity_id":386561},
    {"entity_id":386056},
    {"entity_id":385996}
  ]
}}
  

That data that's contained inside of $data->response->docs is an array of node IDs. Sweet. Just what we needed! Now we can load these nodes with node_load_multiple() and pass them to a theme function that will take care of rendering a list of teasers.

Moving the request to the front end

Although everything was working, we saw that the block was not only rendered on article pages, but also on any pages with a slider — and it had to reload through AJAX whenever a user scrolled to a new item.

Following an approach that we took with other blocks on the MSNBC project, we decided to sidestep Drupal entirely and use AngularJS to obtain the list of related teasers and render them. This was a double bonus: AngularJS would speed up the initial page load by building the list of teasers after the main page content was sent, and we'd be able to reuse the self-contained "related content" plugin elsewhere on the site. You can find a sample of the plugin implementation at this repository on GitHub.

Conclusion

When we deployed the new version of the More Like This block to production, we saw an immediate drop in server load. The following New Relic graphs show a brief spike during the deployment (when caches are cleared), and an immediate improvement in response time:

New relic request load stats after deploying the new code

Building dynamic lists of related content easily result in expensive queries. We hope that this article helps you when the problem occurs: it's easy to optimize them by moving them to Solr. We've also included a list of related links to help you dive deeper. Good luck!

newsletter-bot