tisdag, juni 17, 2008

Testing Regular Expressions

Something has been worrying me a bit lately. Being test infected and all, and working for ThoughtWorks, where testing is part of the life blood, I think more and more about these issues. And one thing I've started noticing is that regular expressions seems to be a total blind spot in many cases. I first started thinking about it when I changed a quite complicated regular expression in RSpec. Now RSpec has coverage tests as part of their build, and if the test coverage is less than a 100%, the build will fail. Now, since I had changed something to add new functionality, but hadn't added any tests for it, I instinctively assumed that it would be caught be the coverage tool.

Guess what? It wasn't. Of course, if I had changed the regexp to do something that the surrounding code couldn't support, one of the tests for surrounding lines of code would have caught it, but I got no mention from the coverage tool that I needed more tests to fully handle the regular expressions. This is logical if you think about it. There is no way that a coverage tool could find all the regular expressions in your source code, and then make sure that all branches and alternatives of that particular regular expression was exercised. So that means that the coverage tool doesn't do anything with them at all.

OK, I can live with that, but it's still one of those points that would be very good to keep in mind. Every time you write a regular expression in your code, you need to take special care to actually exercise that part of the code with many inputs. What is many in this case? That's another part of the problem - it depends on the regular expression. It depends on how complicated it is, how long it is, how many special operators are used, and so on. There is no real way around it. To test a regular expression, you really need to understand how they work. The corollary is obvious - to use a regular expression in your code, you need to know how to test it. Conclusion - you need to understand regular expressions.

In many code bases I haven't seen any tests for regular expressions at all. In most cases these have been crafted by writing them outside the code, testing them by hand, and then putting them in the code. This is brittle to say the least. In the cases where there are tests, it's much more common that they only test positives, and not negatives. And I've seldom heard of code bases with enough tests for regular expressions. One of the problems is that in a language like Ruby, they are so easy to use, so you stick them in all over the place. A standard refactoring could help here, by extracting all literal regular expressions to constants. But then the problem becomes another - as soon as you use regular expressions to extract values from a string, it's a pain to not have the regular expression at the same place as the extracted groups are used. Example:
PhoneRegexp = /(\d{3})-?(\d{4})-?(\d{4})/
# 200 lines of code
if phone_number =~ PhoneRegexp
puts "phone number is: #$1-#$2-#$3"
If the regular expression had been at the same place as the usage of the $1, $2 and $3 it would have been easy to tie them to the parts of the string. In this case it would be easy anyway, but in more complicated cases it's more complicated. The solution to this is easy - the dollar numbers are evil: don't use them. Instead use an idiom like this:
area, number, extension = PhoneRegexp.match(phone_number).captures
In Ruby 1.9 you will be able to use named captures, and that will make it even easier to make readable usage of the extracted parts of a string. But fact is, the difference between the usage point and the definition point can still cause trouble. A way of getting around this would be to take any complicated regular expression and putting it inside of a specific class for only that purpose. The class would then encapsulate the usage, and would also allow you to test the regular expression more or less in isolation. In the example above, maybe creating a PhoneNumberParser would be a good idea.

At the end of the day, regular expressions are an extremely complicated feature, and in general we don't test the usage of them enough. So you should start. Begin by first creating both positive and negative tests for them. Figure out the boundaries, and see where they can go wrong. Know regular expressions well enough to know what happens in these strange circumstances. Think about unicode characters. Think about whitespace. Think about greedy and lazy matching. As an example of something that took a long time to cause trouble; what's wrong with this regexp that tries to discern if a string is a select statement or not?
And this example actually covers most of the ground, already. It checks case insensitive. It checks for white space before any optional parenthesis, and for any white space after. It makes sure that the word SELECT isn't continued by checking for at least one non word character. So what's wrong with it? Well... It's the caret. Imagine if we had a string like this:
"INSERT INTO foo(a,b,c)\nSELECT * FROM bar"
The regular expression will in fact match this, even though it's not a select statement. Why? Well, it just so happens that the caret matches the beginning of lines, not the beginning of strings. The dollar sign works the same way, matching the end of lines. How do you solve it? Change the caret to \A and the dollar sign to \Z and it will work as expected. A similar problem can show up with the "." to match any character. Depending on which language you are using, the dot might or might not match a newline. Always make sure you know which one you want, and what you don't want.

Finally, these are just some thoughts I had while writing it. There is much more advice to give, but it can be condensed to this: understand regular expressions, and test them. The dot isn't as simple as it seem. Regular expressions are a full blown language, even though it's not turing complete (in most implementations). That means that you can't test it completely, in the general case. This doesn't mean you shouldn't try to cover all eventualities.

How are you testing your regular expressions? How much?

11 kommentarer:

Anonym sa...

Thanks for writing this nice post - its always good to think more about our blind spots. As to testing regexps, I would think it would be a simple matter to encapsulate it into a single method and then test the hell out of it - a full-blown class seems a bit too much.

(adding semi-colons in the expectation that the code will be destroyed in the process of posting it)
def parse_phone example;
PhoneRegexp = /(\d{3})-?(\d{4})-?(\d{4})/;
if phone_number =~ PhoneRegexp;
return {:area => $1, :number => $2, :extension => $3};
return nil;

This would seem fairly simple to test with whatever you could come up with (sorry, it requires some imagination to come up with good tests).

Anonym sa...

It really is a bit of a problem testing regular expressions. And it seems the best solution is brute force tests with all the input you can think of.

One thing that might help a bit is, specially if you choose to encapsulate the regular expression inside a method, you can test if the result variables are what you hoped for. Check if the string isn't to big, if it really is a number, if it doesn't contain spaces, that sort of validaions. It can give you a bit more security...

Ben Mabey sa...

I have seen a similar pattern in my projects where common regular expressions are extracted into constants. This allows you to write specific examples against each regex outside of any implementation context. Your PhoneNumberParser is a good idea, however for most general regexs I like to keep them grouped in a single Formats module. So I'll have a Formats::EMAIL, Format::PHONE, etc, and my formats_spec will have example groups for each regex constant in the module.

Aslak Hellesøy sa...

Sure - even though RCov reports 100% line coverage and all self-specs pass, we don't have 100% branch coverage (far from it).

What this means in practice is that RSpec *probably* works, but that there are plenty of changes that can be made without having any specs pick it up.

You need a mutation testing tool to do that - such as Heckle, and I'm afraid we're not using that yet. (But RSpec supports it!)

Anonym sa...

I'd be careful with generalisations about people not testing regular expressions. I know lots of people who test regular expressions.

Here's a few guidelines I like to use:

* Understand that you won't be able to test all scenarios
* Use a good sample of real examples for both positive and negative cases. Add any terms that fail in production to the tests
* Like any good tests, leave behind the intent in whatever form works best (this may mean creating separate test cases for iterating through the samples). Tell people that whitespace is/isn't important, etc.
* Run through your samples in a smart way. You might group similar variations and iterate through them with a meaningful error message if it fails (related to points 2 and 3)


Anonym sa...

I use TDD, so I build the regular expression as I write the tests for it. Good thing, too, else I'd never get them right!

Just as in all testing, you can't test all possibilities - you test the obvious cases plus the edge cases.

I don't really think regular expressions are significantly different from other code as far as testing is concerned.


Anonym sa...

Interesting that you point out that the coverage tool didn't cover the branches of the regular expression. Which is, like you said, logical when you think of it.
Actually if you look at Regular Exp ression as a DSL for 'string manipulation', it becomes clear that branches won't be checked if the coverage-tool doesn't support the language.
So the same counts for SQL-statements in most java applications. (Or xml based DSL's for that matter). If you change the SQL code and create a new branch, I think most coverage tools wont detect the new branch.

I guess you just allways have to evaluate where your application uses DSL't that are not 'read' by your coverage tool. And check the coverage by hand.

Daniel sa...

That regular expression won't catch spaces between parenthesis: "( ( SELECT ..." won't be recognized.

It also won't catch a select which is at the end of line, because it requires one more character. Now, that might be expected behavior, but if select can continue in a following line without any kind of escape, it won't work either.

Ola Bini sa...

Daniel, true about the space between parenthesis, so you need to have something like /(?:\s*\()*\s*/ to loop it, but I didn't want to make it more complicated than it was.

With regard to your other points you're wrong. \W will actually match end of line, but not end of string.

Dave Kirby sa...

The /^\s*\(*\s*SELECT\W+/i regex has another problem that has not been mentioned - it is highly inefficient. If it tries to match a string that starts with N spaces not followed by SELECT, then it will make N**2 matches before it gives up. This may not be noticeable for single matches on small strings, but if you were searching through every line in a large text file then it could slow things down by orders of magnitude.

If you don't see why it takes N**2 matches before it fails then I recommend reading the chapter on regex performance in the book "Mastering Regular Expressions" by Jeffrey Friedl.

Here is a more efficient version of the regex:

Demon sa...

Regular expression is really wonderful to parsing HTML or matching pattern. I use this a lot when i code. Actually when I learn any new langauge, first of all I first try whether it supports regex or not. I feel ezee when I found that.


Here is about ruby regex. This was posted by me when I first learn ruby regex. So it will be helpfull for New coders.