by Jeff Eaton on June 22, 2009 // Short URL

Drupocalypse Now (or, dangerous integer handling in drupal_write_record)

A couple of weeks ago, Twitter started circulating news about the upcoming Twitpocalypse. The easy 'default' storage format numbers in many programming languages and databases is the 'signed integer.' It's usually capable of representing values from -2,147,483,647 to +2,147,483,647. As fate would have it, the number of Twitter messages in existence was nearing that limit, and any developers who'd built software that stored tweets would encounter errors unless they started using larger number formats to store Twitter IDs.

Drupal's Twitter module (which James Walker and I co-maintain) had that problem: it archived Twitter statuses in the database, and it saved the Twitter IDs as signed integers. We released an update several weeks ago that changed the database column to an "unsigned bigint," capable of holding numbers as high as 18,446,744,073,709,551,615. Disaster averted!

Not Quite

When the big day arrived and Twitter Status ID 2,147,483,647 was finally posted, we started getting sporadic bug reports from users despite the fix we'd put in place. Even Sony Music, one of Lullabot's Twitter-using clients, got reports from their artists. Chris Daughtry's tweets weren't updating on his web site, and social media starvation was starting to set in. Time for some debugging!

After soliciting some feedback from affected users, it became clear that Twitter module was only affected on 32-bit servers, or where PHP was running in 32-bit mode. On those machines, somewhere between retrieving the value from Twitter's API and saving it to the bigint column in the database, it was being squished back into a signed int and corrupted. Setting up on another server, I was able to reproduce the problem and track it down to its source -- Drupal core.

The Terrible Truth

Turns out, PHP is smart enough to handle those large numbers without immediately choking. If you pull in a large value from Twitter's API, for example, it will treat it as a double (not as large as MySQL's 'bigint' data type, but certainly better than a normal int) without any special instructions. And if you build a SQL query to insert that value into the database, everything works fine. The trouble comes when those numbers are passed through Drupal's _db_query_callback() function. That function uses placeholders like %d and %s to sanitize numbers and strings, preventing SQL injection attacks. Unfortunately, the commonly used %d placeholder represents an integer.

On 32-bit hardware, PHP is only able to handle big integers by treating them as doubles -- but passing them through any filtering methods with a %d placeholder converts them back to integers, overflowing the maximum size and discarding the "real" original value. The easiest way around this if you're building your own queries and calling db_query() is to use the undocumented %n placeholder when dealing with large numbers -- it just ensures that a value is a number, but doesn't change its fundamental value type.

Oh, drupal_write_record()...

In Drupal 6, two great pieces of code were added to the project's bag of tricks. First, SchemaAPI. It allowed developers to defined their module's database tables using descriptive arrays, defining columns and indexes in a database-agnostic format. While SQL is pretty clean when it comes to SELECT, INSERT, and UPDATE syntax, table creation and support for various data types is all over the place -- inconsistent from one database system to another. SchemaAPI lets people say, "This column should be an integer, and it should be a BIG UNSIGNED one, at that" -- then generates the appropriate SQL to create the column for each database system.

Now, if you're listening to that and thinking it sounds suspiciously close to the integer-handling issue we've been talking about, you're right. SchemaAPI collides messily with the issue when the drupal_write_record() function is used. That function takes any PHP object or array, along with the name of the table it should be saved to, and uses SchemaAPI information to build a safe, sanitized insert or update query for the object. It's extremely useful, and it's used all over the place in Drupal core and contrib. The node module, the user module, and the taxonomy module all save and update their data using the function.

Unfortunately, SchemaAPI treats 'big unsigned integers' as just plain 'integers' when it comes time to construct one of those insert or update queries. It assumes that a %d placeholder would be appropriate, and ... voila. Numbers larger than the plain-old-integer size limit are lost, even if PHP is smart enough to deal with them and the database columns is large enough to hold them. This is the situation that Twitter module ran into. It's the situation that your entire Drupal installation could run into, too, if you ever hit more than a few billion node revisions and you're running on 32 bit hardware.

What's the solution?

If you're writing your own queries by hand and passing them through db_query(), the easiest workaround is to use the %n placeholder instead of %d when manipulating big or unsigned integers. If you're using db_write_record(), it's more complicated. SchemaAPI will always try to use %d for integers, regardless of their real size: there's no safe way to insert or update bigints or unsigned ints as long as that is the case. There is an issue in the Drupal bug queue highlighting the problem, but cross-database compatibility are making things tricky.

For the time being, the best option when using drupal_write_record() is to change the database schema itself. Instead of the SchemaAPI data type 'integer' with a size of 'big', use a 'numeric' column with a precision of 20 and a scale of 0. It's not as efficient for storage as a normal bigint (Versions of MySQL older than 5.0.2 will actually store the numbers as strings, for example), but it will be able to store the values accurately. More important, SchemaAPI will automatically use the %n placeholder instead of %d, allowing the large numbers to pass through without problems.

Jeff Eaton

Senior Digital Strategist

Want Jeff Eaton to speak at your event? Contact us with the details and we’ll be in touch soon.

Comments

chx

Double is NOT bigint

PHP on 32 bit overflows integers to floats which are stored as 64 bit IEEE 784 floating point numbers aka. doubles. These are NOT the equivalents of MySQL bigint which are indeed 64 bit signed integers. PHP can store ints on 64 bits only on 64 bit architectures. That said, it is an incredible pain to handle anything more than 32 bits because PHP does not have a unified 64 bit int.

You will, sooner or later, run into weird and weirder problems with using floats instead of ints. (Come to Paris, I will tell you more. Or eventually it will find its way to phpwtf.org) To avoid, I heartily recommend the bcmath extension (which James used for OpenID).

Reply

eaton

You're correct

PHP's doubles aren't the equivalent of a bigint, but they are at least larger than signed ints. I've updated the article with clearer wording.

The really serious problem at the moment is Drupal's insistence on explicitly calling int() on numbers that could be PHP doubles; that makes it impossible to safely insert or update an unsigned int or bigint using SchemaAPI generated queries when running in 32-bit mode.

There are no particular happy solutions, but at least using the 'number' SchemaAPI type allows us to do things like reading in foreign keys from external APIs without blowing up.

Reply

chx

Its not the size

Once again, the problem is not that bigints and double have different sizes -- they are both stored on 64 bits, actually. The problem is that double is a float and not an integer.

I still think that bcmath and %n is your best bet.

Reply

eaton

Agreed.

The 'size' issue when dealing with doubles is about 'accurate size' -- a double can store up to 999,999,999,999 before PHP starts spitting it out in scientific notation and going wonky; as long as we're not doing any math with it, that value is a "safe" maximum to pass into a NUMERIC field using db_write_record(). And it does give us more breathing room than a signed int.

As you note, though, bcmath is the only real option if you need to actually DO stuff with those values. After digging around and researching it, and writing the article, I still feel there aren't any really good solutions. Do you have any thoughts on what those who can't slap bcmath onto their server might be able to do?

Reply

eaton

Sorting and filtering

Though it's not terribly common, things like sorting by status id are impossible if you store as strings. This problem can also affect foreign keys for things like node ids and node revision ids, if a site ever accumulates enough content to run into the signed int limit.

Reply

bec

Is sort important enough to

Is sort important enough to deal with the issues around integers? My understanding is that Twitter status ids aren't explicitly sequential, anyway--they should be ordered by date.

Reply

eaton

Strings and Twitter

In the case of Twitter specifically, you're right -- the advantages of bigints vs strings is negligible beyond the small increase in data storage requirements. (A few bytes here, a few byes there, pretty soon you have some real K...?)

However, the underlying problem that the Twitpocalypse highlighted is the real issue: in cases where we do legitimately need bigints or unsigned ints, drupal_write_record() will sabotage us. Databases support that type for a reason!

Reply