When Regular Expressions Go Too Far

Taming Filthy, Greedy Regular Expressionses With the Ungreedy Flag

This week, I finally learned how to fix a regular expression problem that has long vexed me and stumped my Googling skills. The details are tricky to explain, but the simplest example is easy to describe. In many cases, a regular expression would get the last instance of a match in a string instead of the first instance. This has caused me all sorts of headaches when using preg_replace() to touch up some HTML in Drupal.

Of course, it's rarely necessary (or even a good idea) to try parsing HTML with regular expressions, and QueryPath is often a better solution. That said, I ignored my own advice and decided to use a regular expression in an input filter anyway. Let's say I want to wrap the text of any header tag (h1 through h6) in a span tag, so that I can better style it with CSS. Here's a basic preg_replace() call that you might expect would do the trick:

  
$html = "

First Header

Some text.

Second Header

"; $regex = "/()(.*)()/i"; $spanned = preg_replace($regex, '$1$2$3', $html);
  1. The regular expression looks for a header tag (h1 through h6): (<h[1-6]>). This will become $1 in the replacement string.
  2. Next, it grabs any text within the header tag: that's (.*), which corresponds to $2.
  3. Third, it find the closing header tag (again, h1 through h6): (<\/h[1-6]>), which corresponds to $3.
  4. And finally, I included the i at the end for case-insensitivity, in case the HTML contains <H2> instead of <h2>.

The replacement string is pretty simple: it just pieces the header back together. $1 is the opening tag, then an opening span, $2 is the text of the header tag, the closing span, then $3 is the closing tag.

Now, with all that in mind, you might expect the output to look like this:

<h2><span>First Header</span></h2> Some text. <h3><span>Second Header</span></h3>

But you would be wrong, just like I was wrong. What you would actually get is this:

<h2><span>First Header</h2> Some text. <h3>Second Header</span></h3>

The opening and closing span get split up across the string. Every time I needed to use a regular expression for something, this would happen, and I would curse under my breath a little bit.

The problem here is that the middle match, the (.*), is "greedy." It just keeps matching characters up until the last place that the third part, (<\/h[1-6]>), will match. Because, remember, that will match on </h2> and </h3>, and it's not smart enough to make sure that the number in the closing tag matches the number in the opening tag (if there's a way to do that, I haven't found it yet). So, the regular expression matches the first opening tag, and the last closing tag, and helpfully wraps everything in between in a span tag. It sees our HTML string as containing only a single match.

The good news is that this is easy to fix. Like the i I tacked on there for case insensitivity, I can also tack on a U to make the regular expression "ungreedy," like so:

  
$html = "

First Header

Some text.

Second Header

"; $regex = "/()(.*)()/iU"; $spanned = preg_replace($regex, '$1$2$3', $html);

The only change here is the addition of the U at the end of the $regex variable. With that in place, the regular expression will find two matches in the HTML, and I finally get what I wanted all along:

<h2><span>First Header</span></h2> Some text. <h3><span>Second Header</span></h3>

The U modifier works for the entire regular expression, so it's good to use if you want your entire expression to be ungreedy. Just today, I learned from esteemed Lullabot James Sansbury that you can also be more specific about greediness by adding a ? after a * or + to make that wildcard ungreedy. In our example, it would look like this:

  
$regex = "/()(.*?)()/i";
  

Placing the ? after the .*, I get the same result as I did when using the U modifier at the end. In this case, I'm only using a single * in my regular expression; if I had more than that one, I might want to use this method instead of the global modifier.

You can learn more about how i, U, and other regex modifiers work in the Pattern Modifiers documentation on php.net. There's also a handy tool called RegExr that will visualize the string as it's matched by a regular expression. Check out the original, greedy regex as compared to the revised, non-greedy alternative.

Published in:

Get in touch with us

Tell us about your project or drop us a line. We'd love to hear from you!