lördag, februari 24, 2007

The world is spinning

If you didn't know that, the title tells it all. Of course, that's old news.

I haven't really been able to blog as much as I wish I could have, lately. There are reasons for this, of course. Two very exciting reasons, in fact. If everything pans out, I will be able to write about it in 2 to 3 weeks time.

In other news, we are gearing up for another JRuby release. This one will be a biggie. Many nice things will be in place, and it will set the record for both new features and bug fixes. I think no one will be disappointed by it, actually.

I have a few presentations lined up too. The closest to now will be in exactly two weeks. I will speak at the Academic Computer Science Festival in Craków, Poland. If you're somewhere close, by all means offer to show me the city. =) I will land March 9:th and fly out again March 11:th. My presentation will be at 17:00 March 10:th, CET. If you would like more information, it can be found at the festivals homepage, here. The presentation will be in English (since I don't speak Polish, obviously), and it will be slightly more technical than the usual JRuby presentations. There will probably be some detail about our runtime, interpreter, parser and lexer, and hopefully I'll get some info in about our YARV and Java bytecode compiler efforts. This will be very exciting to talk about, I'm kinda salivating just thinking about it. =)

Another, more long term presentation, has just been decided. I will attend TheServerSide Java Symposium Europe, in Barcelona, from June 27:th to June 29:th, and talk about JRuby from the perspective of a Java developer, and what it can do for you. Hopefully I will have time to see the city too.

I will update with more information when possible.

onsdag, februari 14, 2007

The new base

I have for a time argued that the JVM should become more like an Operating System, and Java the language for OS development. Other languages should run on top of the JVM for running applications. It seems I'm not alone in this line of thought. Robert Varttinen wrote some about it here. To me it seems like a compelling future. But it seems the next logical step for enterprise applications would be further virtualization. I would for example like my beans implemented in Ruby. But not only that, I would want all the business logic to be OS agnostic. I would like my Ruby logic to live on top of a JVM J2EE server, but that logic should be able to move, transparently, to a .NET-server and provide the same business logic at that place. What would be even better is if I didn't have to deploy it manually to all places, but the logic would just move to the places where it's needed. Will we see that anytime soon?

tisdag, februari 13, 2007

RailsConf galore

So, it's decided. Me and two colleagues from Karolinska Institutet will attend RailsConf in Portland, OR. I'm looking very much forward to it, especially since we are also going to JavaOne, and we plan to drive from San Francisco to Portland.

Hopefully I'll get to meet many of the people on the Ruby scene that I've only had mail contact with until now. See you there!

söndag, februari 11, 2007

Ragel performance

I did some performance testing on the old and new Resolver implementation. The testing have some stupid tests that exercise bad parts of both implementations (like longest match, where it can't be decided what type something is until we have to backtrack about 20 characters). I placed these 24 strings in an array, and pounded on it with an instance of the ResolverImpl that is used in exactly the same way on all scalar values in an YAML document. The objective is to find out if the value is an implicit type or not. So basically, we give it a String, and get back a tag URI. So it's not like I'm parsing a language or anything. I'm just doing some recognizing here.

The old implementation was based on a Map> where the first letter of the string to resolve was used as an index to find a list of patterns to try sequentially. This worked fine, and made it extensible. But not very fast. This is the baseline. For 24 different strings, iterated 100 000 times for 2 400 000 resolves it takes 7879ms. That's OK, but not great.

Now, the new Ragel implementation is dead simple. It's just a translation of the regexps in the aforementioned Pattern's into a state machine. At EOF out actions (%/ for people in the Ragel knowhow), I execute an action that sets a local variable to a tag, and at the end of the resolve method returns that tag. Dead simple, and not exercising the full strength of Ragel, of course.
So, for the same number of resolves, this ResolverImpl takes 1288ms. That's 611% improvement in speed. Ain't it nice to have a friend such as Ragel? And the best part is, for harder tasks, these improvements would be even larger.
Finite State Machines are your friends. All your base are belongs to us.

Results of jvYAMLb

Well, the YAML-based loading is in JRuby trunk. On the way, some parts of the codebase got seriously simplified. Very nice. The final result, with regard to performance, is about 20-30% on speed. But the important gain is in memory usage. The new implementation takes only about one fourth of the memory the original used. So that's great.

Regarding the Resolver, as I mentioned in the last post, it required a different approach, since regular JvYAML uses regular expressions to recognize implicit tags. Since that approach isn't good with byte arrays, I decided to use Ragel to generate a recognizer. That approach was very successful. As soon as I got that working it was the obvious approach. Ragel is good. Ragel is great. Ragel is wonderful. I will use the same approach for regular JvYAML to get away from all those Java regexps.

So, next step will be to do the same conversion of the emitter. Of course, at that point performance isn't that important. It's more about memory usage and the need to get away from another external dependency in JRuby.

lördag, februari 10, 2007

Faster YAML with byte processing

As noted in my last post, I have started work on converting JvYAML into JvYAMLb. Right now I have finished the work on the Scanner and the Parser, and it's looking quite good. The numbers I reported in the last post for regular JvYAML performance was wrong though. We're looking at about 7.8s to 10.0s for scanning that 3.5MB gemspec file. (And that's only the scanning, not file IO). But with the Scanner converted to use bytes and ByteList, the same processing takes 2.8s. That's a substantial difference. But it doesn't end with that.

As I said I also converted the Parser. It doesn't do any String processing at all, so I didn't expect either a speedup or slowdown except for that from the Scanner. But... Before, parsing the gemspec took 18.515s, but after, it runs in 4s. That's a dramatic speedup, and I don't really know where it comes from. Unless the earlier implementation generated so much more garbage, and used more memory, that it was noticeable in speed. Anyway, this looks good for JRuby YAML processing, since I expect big reductions in complexity in the callpath and generation of objects after the YAML processor is byted all the way through.

But tomorrow it's time to work on the Resolver, and that's going to be hard. Optimally, it would be nice to have a byte-based Regexp engine. And maybe that would be something for JRuby too, know? Our Regular Expressions must be dead slow now that they have to convert to strings all the time.

fredag, februari 09, 2007

Announcing JvYAMLb, a fork

The conversion to using byte-arrays as the basis of our String work in JRuby has led me to realize that JvYAML just doesn't cut it anymore. The performance wasn't good to begin with, and it's even worse having to convert EVERY SINGLE STRING read into bytes. That's no good. As an example why something needs to be done I'm going to describe the transformations that happen to data in JRuby if executing this code:
YAML.load_file "gems.yml"
First, the file is opened, and wrapped inside a RandomAccessFile. Then data is read from it by YAML. Reading will proceed like this:
1. Bytes are read through the RAF, hopefully in chunks.
2. Those bytes are wrapped in a RubyString so they can be returned from the IO#read method.
3. An IOReader wraps that RubyIO object, gets the RubyString and converts it from bytes into a String, and this String gets converted into a char array.
4. That char array is returned to the YAML Scanner.
5. The chars from the char array is collected in a StringBuffer, and saved in various Strings as token values.
6. The parser, resolver and constructor work on these Strings in various ways.
7. The JRubyConstructor takes these Strings and creates RubyString objects from them and in the process converting the String back to a byte array.

Is there any doubt that this process is slow? Well, it hasn't been that big of a problem until now, since we are doing so well on performance in other parts of the system.

So, the radical decision is to rewrite JvYAML, making it more SYCK-compliant, working with InputStreams and byte-arrays, and in the process get away from several of the steps above. So that's what I'm going to do. I hereby create JvYAMLb. It will only be a part of the JRuby codebase, but it will be reasonably separate, so it can be extracted for other purposes. I will not stop work on regular JvYAML, but will maintain both projects.

Since the objective of this new project is blazing speed, I will post some numbers on this now and again. But first I will show you the speed of the regular system. JvYAML's Scanner can scan an old gem source index (about 3.5MB) of 435654 tokens in about 1654ms. This is the baseline I'm going to use to test performance, and I'll post more on this as soon as the byte-based Scanner is ready to try out.

Bytes bites. Or maybe not.

Well, the byte arrays are in, for good and evil. We had to wrap them in a counterpart to StringBuffer, but backed by byte[] instead, since all that explicit allocation and deallocation was way unperformant.

Of course, we aren't seeing any performance benefits from this right now. The problem is that there is still many places that use IRubyObject#toString to get at the contents. That operation is very expensive right now, so gem installs are slower, for example. But we have good hopes on improving the situation, and many parts of the codebase have become much clearer without the need to do String-to-byte[] and byte[]-to-String all over the place.

tisdag, februari 06, 2007

Fractured blogging

My blogging in the future will be a little bit fractured (or more fractured, some might say), since I have been invited to write at Inside Java for APress. The address is http://java.apress.com and I do recommend that you subscribe to the feed if you're interested in Java or associated technologies. My first posting was about the two closure proposals for Java 7, and I will try to focus my posting there to be more Java specific, mostly in article format. But if I write anything I deem to be exceptionally good, I promise to link from here to there.

Go subscribe now!

Serial JRuby

Things are really moving along faster than ever in JRuby land. It's so fun! As my last entry told you, Hpricot is now available for JRuby (and Java) people. I need to share a few lines from the logs of yesterday evenings conversation at #jruby:

<headius> seeya ola!
* shellac does some xsl-ing, plays on the wii,
then finds ola got HPRICOT working in that time
<shellac> I'm wasting my life
Some would say that what I do with JRuby is a waste of life... Well, we'll see about that.

Anyway, what's happened in JRuby world since last week? First, and most important, Charles has changed our RubyString implementation. It used to be backed by either a Java String or a StringBuffer. The problem with both of these is that Ruby has a tendency to use Strings as byte buckets. And our code was riddled with encoding and decoding into and out of byte arrays. So Charles took the big step, converted RubyString to use a byte-array instead, and fixed all the bugs that he found by doing that. The result is a happier codebase, less encoding and possibly faster Zlib and IO operations. That's big.

Tom is working on removing visibility and refactoring scopes. That could have huge impact too.

This Sunday I merged and fixed some code that allow Ruby code to inherit from Java classes and override methods there, and this overriding will be seen if an instance is sent back to Java. I'm planning on using this for some interesting tricks with Java ContentHandler's, and this functionality is really, really, really important. But it's also complex, since it requires generating bytecode at runtime. Fun, but hard. But now it's in trunk, and it's time to find the bugs in it and fix them.

I also need you to go read what Jonas Bonér has done with JRuby and OpenTerracotta. I could describe it here, but Jonas does a good job of it himself. So go there: http://jonasboner.com/2007/02/05/clustering-jruby-with-open-terracotta/. Very cool stuff, indeed!

So, the future is coming faster each day. JRuby will still conquer the world!

Hpricot goodness

This is just so cool, I cannot contain it. For those of you who haven't heard about Hpricot, it is one of why the lucky stiff's incredibly cool tools (which he probably will use to take over the world any day now...). It's HTML parsing goodness, very flexible, with the goal of being able to parse (and fix) everything that Firefox handles.

"So what?" you're probably asking... Well, Hpricot uses Ragel and some C code to achieve blinding speed. This means JRuby can't run it. Or I should say couldn't run it:

orpheus:~/workspace/jruby> jruby bin/gem install hpricot --source http://code.whytheluckystiff.net
Bulk updating Gem source index for: http://code.whytheluckystiff.net
Select which gem to install for your platform (java)
1. hpricot 0.5.110 (jruby)
2. hpricot 0.5.110 (mswin32)
3. hpricot 0.5.110 (ruby)
4. hpricot 0.5 (ruby)
5. hpricot 0.5 (mswin32)
6. hpricot 0.5.0 (ruby)
7. hpricot 0.5.0 (mswin32)
8. hpricot 0.4.99 (ruby)
9. hpricot 0.4.99 (mswin32)
10. hpricot 0.4.92 (ruby)
11. hpricot 0.4.92 (mswin32)
12. Skip this gem
13. Cancel installation
> 1
Successfully installed hpricot-0.5.110-jruby
Installing ri documentation for hpricot-0.5.110-jruby...
Installing RDoc documentation for hpricot-0.5.110-jruby...
That's right, Hpricot is now more promiscuous than any other gem with native parts.
What can you do with it? Well, I'm just going to point you to _why's own description of it. All he says at http://code.whytheluckystiff.net/hpricot/ will work fine in JRuby!

How did this come to be? Well, me and _why did some joint hacking, which was helped along by the fact that Adrian Thurston (the genius behind Ragel) recently added Java support to it. So, basically, most of the Ragel definition is exactly the same for both the C and the Java versions. The native code has been factored out, and both versions are buildable with rake from _why's code repository.

This is important. Don't think anything else. This strategy will, and can, be used for other gems with native parts. It's just a question of time.