Processing expensive back-end operations

Rubik's cube

During the life cycle of a Drupal project there are many situations when you need to do expensive operations. Examples of these are: populating a newly created field for thousands of entities, calling a webservice multiple times, downloading all of the images referenced by a certain text field, etc.

In this article, I will explain how you can organize these operations in order to avoid the pitfalls related to them. I created a GitHub repository with the code of every step in this article.

You will see how we can use update hooks, drush commands or Drupal queues to solve this. Depending on the situation you’ll learn to use one or the other.

Scenario

The UX team at the Great Shows TV channel has come up with an idea to improve the user experience on their Drupal website. They are partnering with Nuts For TV, an online database with lots of reviews of TV show episodes, fan art, etc. The idea is that whenever an episode is created –or updated– in the Great Shows website, all the information available will be downloaded from a specific URL stored in Drupal fields. Also, they have gone ahead and manually updated all of the existing episode nodes to add the URL to the new field_third_party_uri field.

Your job as the lead back-end developer at Great Shows is to import the episode information from Nuts For TV. After writing the preceptive hook_entity_presave that will call _bg_process_perform_expensive_task, you end up with hundreds of old episode nodes that need to be processed. Your first approach may be to write an update hook to loop through the episode content and run _bg_process_perform_expensive_task.

The example repo focuses on the strategies to deal with massive operations. All the code samples are written for educational purposes, and not for their direct use.

Time expensive operations are many times expensive in memory resources as well. You want to avoid the update hook to fail because the available memory has been exhausted.

Do not run out of memory

With an update hook you can have the code deployed to every environment and run database updates as part of your deploy process. This is the approach taken in the first step in the example repo. You will take the entities that have the field_third_party_uri attached to them and process them with _bg_process_perform_expensive_task.

/**
 * Update from a remote 3rd party web service.
 */
function bg_process_update_7100() {
 // All of the entities that need to be updated contain the field.
 $field_info = field_info_field(FIELD_URI);
 // $field_info['bundles'] contains information about the entities and bundles
 // that have this particular field attached to them.
 $entity_list = array();  

 // Populate $entity_list
 // Something like:
 // $entity_list = array(
 //   array('entity_type' => 'node', 'entity_id' => 123),
 //   array('entity_type' => 'node', 'entity_id' => 45),
 // );

 // Here is where we process all of the items:
 $succeeded = $errored = 0;
 foreach ($entity_list as $entity_item) {
    $success = _bg_process_perform_expensive_task($entity_item['entity_type'], $entity_item['entity_id']);
    $success ? $succeeded++ : $errored++;
 }
 return t('@succeeded entities were processed correctly. @errored entities failed.', array(
   '@succeeded' => $sandbox['succeeded'],
   '@errored' => $sandbox['errored'],
 ));
}

This is when you realize that the update hook never completes due to memory issues. Even if it completes in your local machine, there is no way to guarantee that it will finish in all of the environments in which it needs to be deployed. You can solve this using batch update hooks. So that's what we are going to do in Step 2.

Running updates in batches

There is no exact way of telling when you will need to perform your updates in batches, but if you answer any of these questions with a yes, then you should do batches:

  • Did the single update run out of memory in your local?
  • Did you wonder if the update was dead when running a single batch?
  • Are you loading/updating more than 20 entities at a time?

While these provide a good rule of thumb, every situation deserves to be evaluated separately.

When using batches, your episodes update hook will transform into:

/**
 * Update from a remote 3rd party web service.
 * 
 * Take all the entities that have FIELD_URI attached to
 * them and perform the expensive operation on them.
 */
function bg_process_update_7100(&$sandbox) {
  // Generate the list of entities to update only once.
  if (empty($sandbox['entity_list'])) {
    // Size of the batch to process.
    $batch_size = 10;
    // All of the entities that need to be updated contain the field.
    $field_info = field_info_field(FIELD_URI);
    // $field_info['bundles'] contains information about the entities and bundles
    // that have this particular field attached to them.
    $entity_list = array();
    foreach ($field_info['bundles'] as $entity_type => $bundles) {
      $query = new \EntityFieldQuery();
      $results = $query
        ->entityCondition('entity_type', $entity_type)
        ->entityCondition('bundle', $bundles, 'IN')
        ->execute();
      if (empty($results[$entity_type])) {
        continue;
      }
      // Add the ids with the entity type to the $entity_list array, that will be
      // processed later.
      $ids = array_keys($results[$entity_type]);
      $entity_list += array_map(function ($id) use ($entity_type) {
        return array(
          'entity_type' => $entity_type,
          'entity_id' => $id,
        );
      }, $ids);
    }
    $sandbox['total'] = count($entity_list);
    $sandbox['entity_list'] = array_chunk($entity_list, $batch_size);
    $sandbox['succeeded'] = $sandbox['errored'] = $sandbox['processed_chunks'] = 0;
  }
  // At this point we have the $sandbox['entity_list'] array populated:
  // $entity_list = array(
  //   array(
  //     array('entity_type' => 'node', 'entity_id' => 123),
  //     array('entity_type' => 'node', 'entity_id' => 45),
  //   ),
  //   array(
  //     array('entity_type' => 'file', 'entity_id' => 98),
  //     array('entity_type' => 'file', 'entity_id' => 640),
  //     array('entity_type' => 'taxonomy_term', 'entity_id' => 74),
  //   ),
  // );

  // Here is where we process all of the items:
  $current_chunk = $sandbox['entity_list'][$sandbox['processed_chunks']];
  foreach ($current_chunk as $entity_item) {
    $success = _bg_process_perform_expensive_task($entity_item['entity_type'], $entity_item['entity_id']);
    $success ? $sandbox['succeeded']++ : $sandbox['errored']++;
  }
  // Increment the number of processed chunks to see if we finished.
  $sandbox['processed_chunks']++;

  // When we have processed all of the chunks $sandbox['#finished'] will be 1.
  // Then the update runner will consider the job finished.
  $sandbox['#finished'] = $sandbox['processed_chunks'] / count($sandbox['entity_list']);

  return t('@succeeded entities were processed correctly. @errored entities failed.', array(
    '@succeeded' => $sandbox['succeeded'],
    '@errored' => $sandbox['errored'],
  ));
}

Note how the $sandbox array will be shared among all the batch iterations. That is how you can detect that this is the first iteration –by doing empty($sandbox['entity_list'])– and how you signal Batch API that the update is done. The $sandbox is also used to keep track of the chunks that have been processed already.

By running your episode updates in batches your next release will be safer, since you will have decreased the chances of memory issues. At this point, you observe that this next release will take two extra hours because of these operations running as part of the deploy process. You decide that you will write a drush command that will take care of updating all your episodes, that will decouple the data import from the deploy process.

Writing a custom drush command

With a custom drush command you can run your operations in every environment, and you can do it at any time and as many times as you need. You have decided to create this drush command so Matt (the release manager at Great Shows) can run it as part of the production release. That way he can create a release plan that is not blocked by a 2 hours update hook.

Drush runs in your terminal, and that means that it will be running under PHP CLI. This allows you to have different configurations to run your drush commands, without affecting your web server. Thus, can set a very high memory limit for PHP CLI to run your expensive operations. Check out Karen Stevenson’s article to test your custom drush commands with different drush versions.

To create a drush command from our original update hook in Step 1 we just need to create the drush file and implement the following methods:

  • hook_drush_command declares the command name and options passed to it.
  • drush_{MODULE}_{COMMANDNAME}. This is the main callback function, the action will happen here.

This results in:

/**
 * Main command callback.
 *
 * @param string $field_name
 *   The name of the field in the entities to process.
 */
function drush_bg_process_background_process($field_name = NULL) {
  if (!$field_name) {
    return;
  }
  // All of the entities that need to be updated contain the field.
  $field_info = field_info_field($field_name);
  $entity_list = array();
  foreach ($field_info['bundles'] as $entity_type => $bundles) {
  // Some of the code has been omitted for brevity’s sake. See the example repo
  // for the complete code.

  // At this point we have the $entity_list array populated.
  // Something like:
  // $entity_list = array(
  //   array('entity_type' => 'node', 'entity_id' => 123),
  //   array('entity_type' => 'file', 'entity_id' => 98),
  // );
  // Here is where we process all of the items:
  $succeeded = $errored = 0;
  foreach ($entity_list as $entity_item) {
    $success = _bg_process_perform_expensive_task($entity_item['entity_type'], $entity_item['entity_id']);
    $success ? $succeeded++ : $errored++;
  }
}

Some of the code above has been omitted for brevity’s sake. Please look at the complete example.

After declaring the drush command there is almost no difference between the update hook in Step 1 and this drush command.

With this code in place, you will have to run drush background-process field_third_party_uri in an environment to be able to QA the updated episodes. Drush also introduces some additional flexibility.

As the dev lead for Great Shows, you know that even though you can configure PHP CLI separately, you still want to run your drush command in batches. That will save some resources and will not rely on the PHP memory_limit setting.

A batch drush command

The transition to a batch drush command is also straightforward. The callback for the command will be responsible for preparing the batches. A new function will be written to deal with every batch, which will be very similar to our old command callback.

Looking at the source code for the batch command you can see how drush_bg_process_background_process is reponsible for getting the array of pairs containing entity types and entity IDs for all of the entities that need to be updated. That array is then chunked, so every batch will only process one of the chunks.

The last step is creating the operations array. Every item in the array will describe what needs to be done for every batch. With the operations array populated we can set some extra properties to the batch, like a callback that runs after all batches, and a progress message.

The drush command to add the extra data to the episodes uses two helper functions in order to have more readable code. _drush_bg_callback_get_entity_list is a helper function that will find all of the episodes that need to be updated, and return the entity type and entity ID pairs. _drush_bg_callback_process_entity_batch will update the episodes in the batch.

It is common to need to run a callback on a list of entities in a batch drush command.  Entity Process Callback is a generic drush command that lets you select the entities to be updated and apply a specified callback function to them. With that you only need to write a function that takes an entity type and an entity object and pass the name of that function to drush epc node _my_custom_callback_function. For our example, all the drush code is simplified to:

/**
 * Helper function that performs an expensive operation for EPC.
 */
function _my_custom_callback_function($entity_type, $entity) {
  list($entity_id,,) = entity_extract_ids($entity_type, $entity);
  _bg_process_perform_expensive_task($entity_type, $entity_id);
}

Running drush batch commands is a very powerful and flexible way of executing your expensive back-end operations. However, it will run all of the operations sequentially in a single run. If that becomes a problem you can leverage Drupal’s built-in queue system.

Drupal Queues

Sometimes you don’t care if your operations are executed immediately, you only need to execute the operations at some point in the near future. In those cases, you may use Drupal queues.

Instead of updating the episodes immediately, there will be an operation per episode waiting in the queue to be executed. Each one of those operations will update an episode. All of the episodes will be updated only when all of the queue items –the episode update operations– have been processed.

You will only need an update hook to insert a queue item to the queue with all the information for the episode to be updated later. First, create the new queue that will hold the episode information. Then, insert the entity type and entity ID in the queue.

At this point you have created a queue and inserted a bunch of entity type and ID pairs, but there is nothing that is processing those items. To fix that you need to implement hook_cron_queue_info so queue elements get processed during cron runs. The 'worker callback' key holds the function that is executed for every queue item. Since we have been inserting an array for the queue item, that is what _bg_process_execute_queue_item –your worker callback– will receive as an argument. All that your worker needs to do is to execute the expensive operation.

There are several ways to process your queue items.

  • Drupal core ships with the aforementioned cron processing. This is the basic method, and the one used by Great Shows for their episode updates.
  • Similar to that, drush comes with drush queue-list and drush queue-run {queue name} to trigger your cron queues manually.
  • Fellow Lullabot Dave Reid wrote Concurrent Queue to process your queue operations in parallel and decrease the execution time.
  • The Advanced Queue module will give you extra niceties for your queues.
  • Another alternative is Queue Runner. This daemon will be monitoring your queue to process the items as soon as possible.

There are probably even more ways to deal with the queue items that are not listed here.

Conclusion

In this article, we started with a very naive update hook to execute all of our expensive operations. Resource limitations made us turn that into a batch update hook. If you need to detach these operations from the release process, you can turn your update hooks into a drush command or a batch drush command. A good alternative to that is to use Drupal’s queue system to prepare your operations and execute them asynchronously in the (near) future.

Some tasks will be better suited for one approach than others. Do you use other strategies when dealing with this? Share them in the comments!

Published in

If you enjoyed this Article, you may also enjoy...