fredag, oktober 26, 2007

Current state of Regular Expressions

As I've made clear earlier, the current regular expression situation has once again become impractical. To reiterate the history: We began with regular Java regex support. This started to cave in when we found out that the algorithm used is actually recursive, and fails for some common regexps used inside Rails among others. To fix that, we integrated JRegex instead. That's the engine 1.0 was released with and is still the engine in use. It works fairly well, and is fast for a Java engine. But not fast enough. In particular, there is no support for searching for exact strings and failing fast, and the engine requires us to transform our byte[]-strings to char[] or String. Not exactly optimal. Another problem is that compatibility with MRI suffers, especially in the multi byte support.

There are two solutions currently on the way. Core developer Marcin are working on a port of the 1.9 regexp engine Oniguruma. This port still has some way to go, and is not integrated with JRuby. The other effort is called REJ, and is a port of the MRI engine I did a few months back. I've freshened up the work and integrated it with JRuby in a branch. At the moment this work actually seems to go quite well, but there are some snags.

First of all, let me point out that this approach gives us more or less total multibyte compatibility for 1.8, which is quite nice.

When doing benchmarking, I'm generally using Rails as the bar. I have a series of regular expressions that Petstore uses for each requst, and I'm using these to check performance. As a first datapoint, JRuby+REJ is faster at parsing regexps than JRuby trunk for basically all regexps. This ranges from slightly faster to twice as fast.

Most of the Rails regexen are actually faster in REJ than in JRuby+trunk, but the problem is that some of them are actually quite a bit slower. 4 of the 22 Rails regexps are slower, by between 20 and 250% percent. There are also this one: /.*_f/ =~ "_fxxxxxxxxxxxxxxxxxxxxxxx" which basically runs about 10x slower than JRuby trunk. Not nice at all.

In the end, the problem is backtracking. Since REJ is a straight port of the MRI code, the backtracking is also ported. But it seems that Java is unusually bad at handling that specific algorithm, and it performs quite badly. At the moment I'm continuing to look at it and trying to improve performance in all ways possible, so we'll see what happens. Charles Nutter have also started to look at it.

But what's really interesting is that I reran my Petstore benchmarks with the current REJ code. To rehash, my last results with JRuby trunk looked like this:
controller :   1.804000   0.000000   1.804000 (  1.804000)
view : 5.510000 0.000000 5.510000 ( 5.510000)
full action: 13.876000 0.000000 13.876000 ( 13.876000)
But the results from rerunning with REJ was interesting, to say the least. I expected bad results because of the bad backtracking performance, but it seems the other speed improvements weigh up:
controller :   1.782000   0.000000   1.782000 (  1.782000)
view : 4.735000 0.000000 4.735000 ( 4.735000)
full action: 12.727000 0.000000 12.727000 ( 12.727000)
As you can see, the improvement is quite large in the view numbers. It is also almost there compared to MRI which had 4.57. Finally, the full action is better by a full second too. Again, MRI is 9.57s and JRuby 12.72. It's getting closer. I am quite optimistic right now, provided that we manage to fix the remaining problems with backtracking, our regexp engine might well be a great boon to performance.

2 kommentarer:

Daniel Berger sa...

Instead of trying to port Oniguruma, what about just using JNA to write bindings for it?

That could save you a lot of time and effort, and kick you guys into gear as far as getting a JNA backend interface going that the rest of us can then consume and refine.

lopex sa...

Daniel:

We've been thinking about it already. There are few reasons:

Threading: Oniguruma uses global locks when initializing code range tables or managing shared AST nodes (like Character Class hashtable).
Oniguruma bytecode interpreter also uses thread locks (it can be turned off but we get it for free in java land, and it'd be a hack to mix foreign threading with java one).

Exceptions: it would be hard to recover from segfaults. Converting Oniguruma errors to Ruby exceptions would also be an ugly hack.

JNI: it requires data separation, so all strings/bytes would have to be copied.

Additional binary distribution: good luck compiling it one Mainframe :D