But the discovery I made was when I looked at the performance of the regular expressions used in Rails. There are exactly 50 of them for each request, so I did a script that checked the performance of each of them against MRI. And I found that there was one in particular that had really interesting performance when comparing MRI to JRuby. In fact, it was between 200 and a 1000 times slower. What's worse, the performance wasn't linear.
So which regular expression was the culprit? Well, /.*?\n/m. That doesn't look to bad. And in fact, this expression displayed not one, but two problems with JRuby. The first one is that any regular expression engine should be able to fail fast on something like this, simply because there is a string that always needs to be part of a string for this expression to match. In MRI, that part of the engine is called bm_search, and is a very fast way to fail. JRuby doesn't have that. Marcin is working on a port of Oniguruma though, so that will fix that part of the problem.
But wait, if you grep for this regexp in the Rails sources you won't find it. So where was it actually used? Here is the kicker: it was used in JRuby's implementation of String#each_line. So, let's take some time to look at a quick benchmark for each_line:
require 'benchmark'As you can see, we simple measure the performance of doing a 100 000 each_line calls on three different strings. The first one is a short string with several newlines, the second is a short string with no newlines, and the last is a long string with no newlines. How does MRI run this benchmark?
str = "Content-Type: text/html; charset=utf-8\r\nSet-Cookie: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa "
TIMES=100_000
puts "each_line on small string with several lines"
10.times do
puts(Benchmark.measure{TIMES.times { str.each_line{} }})
end
str = "abc" * 15
puts "each_line on short string with no line divisions"
10.times do
puts(Benchmark.measure{TIMES.times { str.each_line{} }})
end
str = "abc" * 4000
puts "each_line on large string with no line divisions"
10.times do
puts(Benchmark.measure{TIMES.times { str.each_line{} }})
end
each_line on small string with several linesOK, this looks reasonable. The large string is obviously taking more time to search, but not incredibly much time. What about trunk JRuby?
0.160000 0.000000 0.160000 ( 0.157664)
0.150000 0.000000 0.150000 ( 0.160450)
0.160000 0.000000 0.160000 ( 0.171563)
0.150000 0.000000 0.150000 ( 0.157854)
0.150000 0.000000 0.150000 ( 0.154578)
0.150000 0.000000 0.150000 ( 0.154547)
0.160000 0.000000 0.160000 ( 0.158894)
0.150000 0.000000 0.150000 ( 0.158064)
0.150000 0.010000 0.160000 ( 0.156975)
0.160000 0.000000 0.160000 ( 0.156857)
each_line on short string with no line divisions
0.080000 0.000000 0.080000 ( 0.086789)
0.090000 0.000000 0.090000 ( 0.084559)
0.080000 0.000000 0.080000 ( 0.093477)
0.090000 0.000000 0.090000 ( 0.084700)
0.080000 0.000000 0.080000 ( 0.089917)
0.090000 0.000000 0.090000 ( 0.084176)
0.080000 0.000000 0.080000 ( 0.086735)
0.090000 0.000000 0.090000 ( 0.085536)
0.080000 0.000000 0.080000 ( 0.084668)
0.090000 0.000000 0.090000 ( 0.090176)
each_line on large string with no line divisions
3.350000 0.020000 3.370000 ( 3.404514)
3.330000 0.020000 3.350000 ( 3.690576)
3.320000 0.040000 3.360000 ( 3.851804)
3.320000 0.020000 3.340000 ( 3.651748)
3.340000 0.020000 3.360000 ( 3.478186)
3.340000 0.020000 3.360000 ( 3.447704)
3.330000 0.020000 3.350000 ( 3.448651)
3.350000 0.010000 3.360000 ( 3.489842)
3.350000 0.020000 3.370000 ( 3.429135)
3.350000 0.010000 3.360000 ( 3.372925)
each_line on small string with several linesOoops. That doesn't look so good... And also, where is the last ten lines? Eh... It's still running. It's been running for two hours to produce the first line. That means that it's taking at least 7200 seconds which is more than 2400 times slower than MRI. But in fact, since the matching of the regular expression above is not linear, but exponential in performance, I don't expect this to ever finish.
32.668000 0.000000 32.668000 ( 32.668000)
30.785000 0.000000 30.785000 ( 30.785000)
30.824000 0.000000 30.824000 ( 30.824000)
30.878000 0.000000 30.878000 ( 30.877000)
30.904000 0.000000 30.904000 ( 30.904000)
30.826000 0.000000 30.826000 ( 30.826000)
30.550000 0.000000 30.550000 ( 30.550000)
32.331000 0.000000 32.331000 ( 32.331000)
30.971000 0.000000 30.971000 ( 30.971000)
30.537000 0.000000 30.537000 ( 30.537000)
each_line on short string with no line divisions
7.472000 0.000000 7.472000 ( 7.472000)
7.350000 0.000000 7.350000 ( 7.350000)
7.516000 0.000000 7.516000 ( 7.516000)
7.252000 0.000000 7.252000 ( 7.252000)
7.313000 0.000000 7.313000 ( 7.313000)
7.262000 0.000000 7.262000 ( 7.262000)
7.383000 0.000000 7.383000 ( 7.383000)
7.786000 0.000000 7.786000 ( 7.786000)
7.583000 0.000000 7.583000 ( 7.583000)
7.529000 0.000000 7.529000 ( 7.529000)
each_line on large string with no line divisions
There are a few interesting lessons to take away from this exercise:
- There may still be implementation problems like this in many parts of JRuby - performance will improve by quite much every time we find something like this. I haven't measured Rails performance after this is fixed, and I don't expect it to actually fix the whole problem, but I think I'll see better numbers.
- Understand regular expressions. Why is /.*?\n/ so incredibly bad for strings over a certain length? In this case it's the combination of .* and ?. What would be a better implementation in almost all cases? /[^\n]\n/. Notice that there is no backtracking in this implementation, and because of that, this regexp will have performance O(n) while the earlier one was O(n^2). Learn and know these things. They are the difference between usage and expertise.
 
 

11 kommentarer:
Ola,
Thanks for this post; this made me go back and look at our regexp usage in places. One question: can you recommend a good resource (book/paper/website/etc.) on regexp (or pcre?) syntax/semantics/implementation/performance? I was not aware that the combination of ".*" and "?" was meaningful at all: I thought the "?" would be redundant, given the "0 or more" semantics of "*".
Regarding resources, the canonical resource is Mastering Regular Expressions by Jeffrey Friedl. It's the best one out there, but I'm not all fond of it. One of it's shortcomings is the severe mixing of terms (for example DFA and NFA regexps). Anyway, its basically the only one, expect various papers on the net.
Appending a question mark to a valid repetition (*, +, ?, or {n}) is valid, and makes the repetition non-greedy. You can search for non-greedy regular expressions and get a better explanation.
Great post. This episode powerfully reinforces the "measure, don't guess" rule of optimization. Would you have ever guessed this was the first bottleneck?
I think you mean O(2^n) and not O(n^2) in the end.
Back in the 1980s, Henry Spencer implemented and released into the public domain a reg-exp library which used backtracking. Unfortunately, nearly every regexp implementation since then (Perl, Ruby, Python, PHP, PCRE, etc.) has reused this approach.
In the absence of back-references (\1,\2, etc.), there is an older and much faster implementation based on finite automata. awk, grep, and company use the faster algorithm.
Wouldn't it be nice if we could replace all these slow regexp implementations with one that used Ken Thompson's classic, fast NFA algorithm, and fell back to the slower backtracking algorithm only when needed to accommodate back-reference extensions?
For more info, see Russ Cox's article, Regular Expression Matching Can Be Simple And Fast
qerub: no, I meant O(n^2), that is an exponential growth, based on the square of n. 2^n is a different class.
gnikdrazil: well, the problem is that implementing all of the features you need to be compatible with MRI is quite hard with a DFA-based implementation. That said, a hybrid solution would have been nice in this case.
I'm confused. Can we agree on that O(n^2) is quadratic growth and O(2^n) is exponential growth?
qerub: yes, you're right. I confused the two.
http://en.wikipedia.org/wiki/Exponential_time
Hi!
Maybe I got something wrong, but I think that
/[^\n]*\n/ and /.*\n/ (that is, without /m) are strictly equivalent.
And why would you need non-greedy matching, if /./ won't match \n anyway?
I know this is an old article, but I have to ask: Why have they implemented String#each_line using regular expressions, when they know the performance is so bad? After all, it shouldn't be hard to make an alternative implementation.
Skicka en kommentar