tisdag, juli 17, 2007

Back to JRuby regular expressions

It seems that this issue comes up every third month. After all the work we have done, we realize that regular expressions need some real work again. Our current solution works quite well. We have imported JRegex into JRuby, and done a whole slew of modifications to it. It runs well, have no issues with to deep regular expressions (Javas engine uses a recursive algorithm, making it stack overflow for certain inputs. Certain very common inputs, in say ... Rails. *sigh*).

But JRegex is good. It's not perfect though. It's slightly slower than the Java engine, it doesn't support everything in the Java engine, and conversely, it supports some things that Java doesn't support. The major problem is that we don't have MRI compliant multibyte support, and the implementation of our engine is wildly different compared to MRI's engine, and Oniguruma.

At some point we will probably just bite the bullet and do a real port of Oniguruma. But until such time comes, I have extracted our current regular expression stuff, and put everything behind a common interface. What that means is that with the current trunk, you can actually choose which Regular Expression engine you want to use. You can even write your own and plug in. The interface is really small right now. At the moment we only have JRegex and Java, and the Java engine doesn't pass all tests (I think, I haven't tried, since that wasn't the point of this exercise.). Anyway; it means you can have Java Regular Expressions if you want them, right in your JRuby code. But only where you want them. So, you can regular which engine is used globally by doing one of these two:
jruby -J-Djruby.regexp=java your_ruby_script.rb
jruby -J-Djruby.regexp=jregex your_ruby_script.rb
The last is current the default, so it's not needed. In the future it may be possible that JRegex isn't the default though, but this options should still be there. But the more nice thing about this is also that you can use Java Regexps inline, even if you want to use JRegex for most expressions:
begin
p(/\p{javaLowerCase}*/ =~ "abc")
p $&
rescue => e
p e
end

p(/\p{javaLowerCase}*/j =~ "abc")
p $&
Now, the first example will actually raise a RegexpError, because javaLowerCase is not a valid character class in JRegex. But not the small "j" I've added to the second Regexp literal! That expression works and will match exactly as you expected.

Inga kommentarer: