A couple of weeks ago, Twitter started circulating news about the upcoming Twitpocalypse.
The easy 'default' storage format numbers in many programming languages and databases is the 'signed integer.' It's usually capable of representing values from -2,147,483,647 to +2,147,483,647. As fate would have it, the number of Twitter messages in existence was nearing that limit, and any developers who'd built software that stored tweets would encounter errors unless they started using larger number formats to store Twitter IDs.
Drupal's Twitter module (which James Walker and I co-maintain) had that problem: it archived Twitter statuses in the database, and it saved the Twitter IDs as signed integers. We released an update several weeks ago that changed the database column to an "unsigned bigint," capable of holding numbers as high as 18,446,744,073,709,551,615. Disaster averted!
When the big day arrived and Twitter Status ID 2,147,483,647 was finally posted, we started getting sporadic bug reports from users despite the fix we'd put in place. Even Sony Music, one of Lullabot's Twitter-using clients, got reports from their artists. Chris Daughtry's tweets weren't updating on his web site, and social media starvation was starting to set in. Time for some debugging!
After soliciting some feedback from affected users
, it became clear that Twitter module was only affected on 32-bit servers, or where PHP was running in 32-bit mode. On those machines, somewhere between retrieving the value from Twitter's API and saving it to the bigint column in the database, it was being squished back into a signed int and corrupted. Setting up on another server, I was able to reproduce the problem and track it down to its source -- Drupal core.
The Terrible Truth
Turns out, PHP is
smart enough to handle those large numbers without immediately choking. If you pull in a large value from Twitter's API, for example, it will treat it as a double (not as large as MySQL's 'bigint' data type, but certainly better than a normal int) without any special instructions. And if you build a SQL query to insert that value into the database, everything works fine. The trouble comes when those numbers are passed through Drupal's _db_query_callback()
function. That function uses placeholders like %d and %s to sanitize numbers and strings, preventing SQL injection attacks. Unfortunately, the commonly used %d placeholder represents an integer
On 32-bit hardware, PHP is only able to handle big integers by treating them as doubles -- but passing them through any filtering methods with a %d placeholder converts them back to integers, overflowing the maximum size and discarding the "real" original value. The easiest way around this if you're building your own queries and calling db_query() is to use the undocumented %n
placeholder when dealing with large numbers -- it just ensures that a value is a number, but doesn't change its fundamental value type.
In Drupal 6, two great pieces of code were added to the project's bag of tricks. First, SchemaAPI. It allowed developers to defined their module's database tables using descriptive arrays, defining columns and indexes in a database-agnostic format. While SQL is pretty clean when it comes to SELECT, INSERT, and UPDATE syntax, table creation and support for various data types is all over the place -- inconsistent from one database system to another. SchemaAPI lets people say, "This column should be an integer, and it should be a BIG UNSIGNED one, at that" -- then generates the appropriate SQL to create the column for each database system.
Now, if you're listening to that and thinking it sounds suspiciously close to the integer-handling issue we've been talking about, you're right. SchemaAPI collides messily with the issue when the drupal_write_record() function is used. That function takes any PHP object or array, along with the name of the table it should be saved to, and uses SchemaAPI information to build a safe, sanitized insert or update query for the object. It's extremely useful, and it's used all over the place in Drupal core and contrib. The node module, the user module, and the taxonomy module all save and update their data using the function.
Unfortunately, SchemaAPI treats 'big unsigned integers' as just plain 'integers' when it comes time to construct one of those insert or update queries. It assumes that a %d placeholder would be appropriate, and ... voila. Numbers larger than the plain-old-integer size limit are lost, even if PHP is smart enough to deal with them and the database columns is large enough to hold them. This is the situation that Twitter module ran into. It's the situation that your entire Drupal installation could run into, too, if
you ever hit more than a few billion node revisions and you're running on 32 bit hardware.
What's the solution?
If you're writing your own queries by hand and passing them through db_query(), the easiest workaround is to use the %n placeholder instead of %d when manipulating big or unsigned integers. If you're using db_write_record(), it's more complicated. SchemaAPI will always try to use %d for integers, regardless of their real size: there's no safe way
to insert or update bigints or unsigned ints as long as that is the case. There is an issue in the Drupal bug queue highlighting the problem
, but cross-database compatibility are making things tricky.
For the time being, the best option when using drupal_write_record() is to change the database schema itself. Instead of the SchemaAPI data type 'integer' with a size of 'big', use a 'numeric' column with a precision of 20 and a scale of 0. It's not as efficient for storage as a normal bigint (Versions of MySQL older than 5.0.2 will actually store the numbers as strings
, for example), but it will be able to store the values accurately. More important, SchemaAPI will automatically use the %n placeholder instead of %d, allowing the large numbers to pass through without problems.