Beware Regular Expressions

by Charles Miller on August 18, 2003

Some people, when confronted with a problem, think "I know, I’ll use regular expressions." Now they have two problems.Jamie Zawinski in comp.lang.emacs.

Regular expressions are a very powerful tool. They're also very easy to mis-use. The biggest problem with regexps occurs when you attempt to use a series of regular expressions to substitute for an integrated parser.

I recently upgraded Movable Type, and in the process I installed Brad Choate's excellent MT-Textile plugin. MT-Textile gives you a vaguely wiki-like syntax for blog entries that rescues you from writing a mess of angle-brackets every time you want to write a post.

I love MT-Textile, but sadly the more comfortable I get with it, the more I realise its limitations. MT-Textile is built on top of a series of regular expressions, and as such, the more you try to combine different Textile markups, the more likely you are to confuse the parser and end up with something different to what you intended. Any parser built on top of multiple regular expressions gets confused very easily, depending on the order the regexps are applied in.

I ran into the same problem with I was running my own wiki. I started with a Perl wiki, which (like all Perl code) was highly dependent on regular expressions. I quickly found that the effort required to add new markup to the wiki, keeping in mind the way each regexp would interact with the previous and subsequent expressions, increased exponentially with the complexity of the expression language.

After a certain point, diminishing returns will kill you.

I'd like to propose the following rule:

Every regexp that you apply to a particular block of text reduces the applicability of regular expressions by an order of magnitude.

I won't pretend to be a great expert in writing parsers—I dropped out of University before we got to the compiler-design course—but after a point, multiple regular expressions will hurt you, and you're much better off rolling your own parser.

Previously: RedHat is Even More Brain-Damaged Than I Thought

Next: Sendmail Configuration for Mummies