A few days ago, me and two colleagues tried to track down a very tricky bug. After some hours looking, we finally found it, and it was actually due to a misconception that I had about the workings of HashSet and HashMap. I'm not sure if I'm the only one that didn't know this, but it's very logical once you've found it out. You see, if you save an object in a HashSet, and then change the object in such a way that the hashCode changes, then you won't find that object in the Set anymore. It will still be there, you will still iterate over it, but if you ask for example set.contains(obj), then it will return false. If you iterate over the set, and call Iterator#remove, this will silently fail to remove anything, since the HashSet can't find the object you want to remove. So, if you save things in a HashSet or use them as keys in a HashMap, make sure that the object is immutable, otherwise you'll get extremely hard-to-find bugs.
Incidentally, one of the best newsletters about Java programming wrote about this issue ages ago. Regardless if you work with Java professionally or just for fun I implore you to subscribe to JavaSpecialists by Dr Heinz Max Kabutz. It can be found here.
söndag, juni 18, 2006
lördag, juni 17, 2006
JRuby developments
I've spent another saturday at work, mostly working with KIMKAT, but taking some time to look at different JRuby-issues before the 0.9 release. The first - and only major - one, was that Zlib didn't seem to work for Charlie and Tom. I couldn't reproduce this error locally, which was very strange. But finally I found the error in my code, and also the reason I couldn't reproduce it in my own JRuby.
I'd managed to not remove a zlib.rb from one of the directories, and my test cases worked (albeit slowly), because it was using the former Zlib-code, and not my newritten one. After I found this problem, it went fairly fast to isolate the real error, in my GzipWriter. The problem was that I didn't call finish on the GZIPOutputStream which means the gzip-stream doesn't end with a gz-trailer. Once this was added, everything seems to work.
Problem number two seemed harder, initially, and the reason for this was that I needed to muck with the innards of JRuby threads to get it working. The problem concerns Signal handling. In Ruby you can trap operating system signals and execute blocks of code when they happen. This is obviously very powerful and useful for lots of things. My first solution for this, was to just save the block, and then call it when the signal occured. This didn't play well with the rest of JRuby, and Tom got some other strange errors from this. So, after some thinking I came to the realization that I had to create a new RubyThread, and start this when the signal occured, and then JRuby's internal thread scheduling would acertain that nothing untoward happens. This works really well, but I just realized one drawback with it, so I'm going back in to fix it now. The current approach can only execute the block once. Ouch.
So, now I've fixed a new approach. This is much more involved though. I had to get really into the internals of the thread architecture so I could actually create a new JRuby-thread with an existing RubyProc, Frame and Block object. So now I create a new RubyThread each time the signal is called, and sets the block-information manually. This is needed so we can run it multiple times.
Anyway, the real point of this blog is the last problem. The speed of RDoc. RDoc does some fairly heavy lifting, and it itsn't especially fast on C Ruby either:
Which actually calls the Ruby hash-method every time someone wants a plain hashCode. This explained some of the bad performance. I added some better hashCodes to different common Ruby-objects, like RubyString, RubyHash, RubyFixnum and others. After this small change I got this:
... That is almost double the speed for general JRuby performance. By just adding a few simple hashCode-methods at different places. So this is the lesson from all this: if you have performance problems; look at your hashCode(), it may be easy to fix!
I'd managed to not remove a zlib.rb from one of the directories, and my test cases worked (albeit slowly), because it was using the former Zlib-code, and not my newritten one. After I found this problem, it went fairly fast to isolate the real error, in my GzipWriter. The problem was that I didn't call finish on the GZIPOutputStream which means the gzip-stream doesn't end with a gz-trailer. Once this was added, everything seems to work.
Problem number two seemed harder, initially, and the reason for this was that I needed to muck with the innards of JRuby threads to get it working. The problem concerns Signal handling. In Ruby you can trap operating system signals and execute blocks of code when they happen. This is obviously very powerful and useful for lots of things. My first solution for this, was to just save the block, and then call it when the signal occured. This didn't play well with the rest of JRuby, and Tom got some other strange errors from this. So, after some thinking I came to the realization that I had to create a new RubyThread, and start this when the signal occured, and then JRuby's internal thread scheduling would acertain that nothing untoward happens. This works really well, but I just realized one drawback with it, so I'm going back in to fix it now. The current approach can only execute the block once. Ouch.
So, now I've fixed a new approach. This is much more involved though. I had to get really into the internals of the thread architecture so I could actually create a new JRuby-thread with an existing RubyProc, Frame and Block object. So now I create a new RubyThread each time the signal is called, and sets the block-information manually. This is needed so we can run it multiple times.
Anyway, the real point of this blog is the last problem. The speed of RDoc. RDoc does some fairly heavy lifting, and it itsn't especially fast on C Ruby either:
$ time ruby d:/programming/ruby/bin/gem install rakeBut now, JRuby isn't near this speed in any way at all:
Attempting local installation of 'rake'
Successfully installed rake, version 0.7.1
Installing RDoc documentation for rake-0.7.1...
real 0m13.915s
user 0m0.000s
sys 0m0.000s
$ time bin/jruby.bat bin/gem install rakeYes, it's about one magnitude slower. Yesterday I found a big reason for this. JRuby does a lot of hashing and spends very much time in HashMap#get. So I was curious what hashCode-algorithm was used internally. What I found was this:
Attempting local installation of 'rake'
Successfully installed rake, version 0.7.1
Installing RDoc documentation for rake-0.7.1...
real 1m42.985s
user 0m0.000s
sys 0m0.010s
public final int hashCode() {
return RubyNumeric.fix2int(callMethod("hash"));
}
Which actually calls the Ruby hash-method every time someone wants a plain hashCode. This explained some of the bad performance. I added some better hashCodes to different common Ruby-objects, like RubyString, RubyHash, RubyFixnum and others. After this small change I got this:
$ time bin/jruby.bat bin/gem install rake
Attempting local installation of 'rake'
Successfully installed rake, version 0.7.1
Installing RDoc documentation for rake-0.7.1...
real 0m59.946s
user 0m0.000s
sys 0m0.010s
... That is almost double the speed for general JRuby performance. By just adding a few simple hashCode-methods at different places. So this is the lesson from all this: if you have performance problems; look at your hashCode(), it may be easy to fix!
torsdag, juni 15, 2006
Working towards 0.9
These last days I've been focusing on fixing issues with RubyGems and Rails, so that JRuby 0.9 will be a really great release. It already is, of course, but all we can do to make it better feels nice. These are the things I've done since sunday.
StringIO
I finally took the time to rewrite all of StringIO to Java, and also really test so that it works correctly. I managed to make the more common usage patterns between 8 and 10 times faster, which I feel is sufficient for now.
Signal and Kernel#trap
Ruby has a close ties to C, which means you can trap low level POSIX signals quite easily. This is hard to support in Java, though. One common use case that most Ruby programs do, is to trap INT so they can break the program gracefully. Rails and WEBrick does this. The current Kernel#trap and Signal was just stubbed out. I found a way to support signal handling on Sun JVM's at least, through the undocumented class sun.misc.Signal. JRuby feels if this is available, if so it uses it, otherwise trapping signals doesn't work. The implementation is really easy, I just grab the block provided and saves this in a Runnable that will be executed when the trap happens.
Zlib, IOInputStream and IOOutputStream
When working with Ruby IO-like objects from Java, it is often very convenient to wrap these in a Input/Output-stream. This isn't totally obvious how to do, though. My first implementation worked, but was intolerably slow. My last post about the plaincharset tells the tale how these things can go wrong. Suffice to say, I managed to fix the streams pretty good, and then started the real work: Reimplementing Zlib::GzipReader and Zlib::GzipWriter in Java. This wasn't as hard as I thought, once I got the IO-streams working as it should.
There was one bug which was really hard to find though, and it was caused by a minor difference between Java's read(n) and Ruby's read(n) methods. In Java, if we ask to read n bytes, we don't necessarily get all the bytes we asked for, even if there are that many bytes available. This is why those read-methods return how many bytes actually were read. But Ruby's read(n) doesn't act this way. It either reads n bytes, or to end of stream, depending on which come first. That one took a while to find.
The reimplementation of Zlib isn't complete yet, but the important classes are done. The deflater and inflater classes already use a backing Java class for performance, and the checksum classes don't need that kind of speed. The performance improvement from reimplementing GzipReader into Java was great, though. It seems to be between 15 and 20 times faster, most often. RubyGems is really useful now.
What's left?
Now, these things are quite minor. They improve different parts of Ruby that are used quite often. RubyGems seem to work more or less perfectly. Rails generation works too. The server-script for Rails almost works, but there is something strange going on with WEBrick yet. I'm wondering if this has something to do with our Socket-code, which I'm not totally into yet. But these are the issues I'll be looking at before 0.9.
In the long (longer at least), I have two main points that interest me. First I want to complete JvYAML, and integrate it with JRuby. And secondly I'm thinking about ways to byte compile (to some Ruby bytecode) parts of the AST tree, much in the same way Charles have been toying with compiling parts of it to Java bytecode. This is mostly just ideas in my head yet, but that's probably something I'll write more about quite soon here.
StringIO
I finally took the time to rewrite all of StringIO to Java, and also really test so that it works correctly. I managed to make the more common usage patterns between 8 and 10 times faster, which I feel is sufficient for now.
Signal and Kernel#trap
Ruby has a close ties to C, which means you can trap low level POSIX signals quite easily. This is hard to support in Java, though. One common use case that most Ruby programs do, is to trap INT so they can break the program gracefully. Rails and WEBrick does this. The current Kernel#trap and Signal was just stubbed out. I found a way to support signal handling on Sun JVM's at least, through the undocumented class sun.misc.Signal. JRuby feels if this is available, if so it uses it, otherwise trapping signals doesn't work. The implementation is really easy, I just grab the block provided and saves this in a Runnable that will be executed when the trap happens.
Zlib, IOInputStream and IOOutputStream
When working with Ruby IO-like objects from Java, it is often very convenient to wrap these in a Input/Output-stream. This isn't totally obvious how to do, though. My first implementation worked, but was intolerably slow. My last post about the plaincharset tells the tale how these things can go wrong. Suffice to say, I managed to fix the streams pretty good, and then started the real work: Reimplementing Zlib::GzipReader and Zlib::GzipWriter in Java. This wasn't as hard as I thought, once I got the IO-streams working as it should.
There was one bug which was really hard to find though, and it was caused by a minor difference between Java's read(n) and Ruby's read(n) methods. In Java, if we ask to read n bytes, we don't necessarily get all the bytes we asked for, even if there are that many bytes available. This is why those read-methods return how many bytes actually were read. But Ruby's read(n) doesn't act this way. It either reads n bytes, or to end of stream, depending on which come first. That one took a while to find.
The reimplementation of Zlib isn't complete yet, but the important classes are done. The deflater and inflater classes already use a backing Java class for performance, and the checksum classes don't need that kind of speed. The performance improvement from reimplementing GzipReader into Java was great, though. It seems to be between 15 and 20 times faster, most often. RubyGems is really useful now.
What's left?
Now, these things are quite minor. They improve different parts of Ruby that are used quite often. RubyGems seem to work more or less perfectly. Rails generation works too. The server-script for Rails almost works, but there is something strange going on with WEBrick yet. I'm wondering if this has something to do with our Socket-code, which I'm not totally into yet. But these are the issues I'll be looking at before 0.9.
In the long (longer at least), I have two main points that interest me. First I want to complete JvYAML, and integrate it with JRuby. And secondly I'm thinking about ways to byte compile (to some Ruby bytecode) parts of the AST tree, much in the same way Charles have been toying with compiling parts of it to Java bytecode. This is mostly just ideas in my head yet, but that's probably something I'll write more about quite soon here.
Java Charsets
Today I've spent some time pounding java.lang.String to give me a byte array I can use as a data format. This seems harder than I thought it should be. So, why can't I use getBytes() or getBytes("ascii") or getBytes("iso-8859-1") or getBytes("utf-8")? Those are fine for certain tasks, but I'm looking for a very specific translation from chars to bytes. The application I was working on is Zlib in Java, for JRuby. Since Ruby have the somewhat funny custom of using Strings as byte buffers this means the output I get from a Ruby IO-operation is a RubyString.
The reason I started trying different paths for this was that Zlib didn't work as it should. Not at all. I knew it worked when I did it one char at a time, because then I casted the char to an int instead (since InputStream#read() returns a int). So, I created this small program:
to see what happened here. Now, I won't bore you with the complete printout from this. But there are a few specific portions that I'd like to share:
Those 63-values keep showing up and destroying everything. If I try another encoding in the getBytes-method it actually gets worse. I couldn't find any way to get this to write the expected output. So, I embarked on a quest. A quest to solve this small trouble, forever and always. The result is plaincharset, a small project consisting of 4 classes. Nothing spectacular, but if you add the jar-file to your classpath you can now use the charset name "PLAIN" to get every byte correctly from getBytes and new String. If you have characters that are not within 0..255 I cannot guarantee anything at all. I hereby release the project in the public domain. The source can be found here, and if you just want the jar-file, download it here.
So, what is the secret behind this marvel? In one word: NIO. The jar-file contains a subclass of CharsetProvider, a subclass of Charset, one CharsetDecoder and one CharsetEncoder. The only classes with anything in them is the decoder and encoder, which gets an input NIO-buffer and an output NIO-buffer. I just read from the input and write to the output, casting where necessary. There is also one service-provider file in the META-INF directory in the jar, which says to use the com.ologix.charset.PlainCharsetProvider as a provider for charsets.
And did this work for my Zlib-implementation? I'm happy to say that it did. It works very well and is both smaller in code length, and much, much faster. I'm happy.
The reason I started trying different paths for this was that Zlib didn't work as it should. Not at all. I knew it worked when I did it one char at a time, because then I casted the char to an int instead (since InputStream#read() returns a int). So, I created this small program:
final byte[] chrs = new byte[256];
for(int i=0,j=chrs.length;i<j;i++) {
chrs[i] = (byte)i;
}
final String str = new String(chrs);
final byte[] bts = str.getBytes();
for(int i=0,j=chrs.length;i<j;i++) {
System.out.println("[" + i + "]= " + (int)chrs[i] + ", " + bts[i] + " ... should be: " + (byte)chrs[i]);
}
to see what happened here. Now, I won't bore you with the complete printout from this. But there are a few specific portions that I'd like to share:
[127]= 127, 127 ... should be: 127and
[128]= -128, -128 ... should be: -128
[129]= -127, 63 ... should be: -127
[130]= -126, -126 ... should be: -126
[140]= -116, -116 ... should be: -116
[141]= -115, 63 ... should be: -115
[142]= -114, -114 ... should be: -114
[143]= -113, 63 ... should be: -113
[144]= -112, 63 ... should be: -112
[145]= -111, -111 ... should be: -111
[156]= -100, -100 ... should be: -100
[157]= -99, 63 ... should be: -99
[158]= -98, -98 ... should be: -98
Those 63-values keep showing up and destroying everything. If I try another encoding in the getBytes-method it actually gets worse. I couldn't find any way to get this to write the expected output. So, I embarked on a quest. A quest to solve this small trouble, forever and always. The result is plaincharset, a small project consisting of 4 classes. Nothing spectacular, but if you add the jar-file to your classpath you can now use the charset name "PLAIN" to get every byte correctly from getBytes and new String. If you have characters that are not within 0..255 I cannot guarantee anything at all. I hereby release the project in the public domain. The source can be found here, and if you just want the jar-file, download it here.
So, what is the secret behind this marvel? In one word: NIO. The jar-file contains a subclass of CharsetProvider, a subclass of Charset, one CharsetDecoder and one CharsetEncoder. The only classes with anything in them is the decoder and encoder, which gets an input NIO-buffer and an output NIO-buffer. I just read from the input and write to the output, casting where necessary. There is also one service-provider file in the META-INF directory in the jar, which says to use the com.ologix.charset.PlainCharsetProvider as a provider for charsets.
And did this work for my Zlib-implementation? I'm happy to say that it did. It works very well and is both smaller in code length, and much, much faster. I'm happy.
The arcitecture of a Meta Directory system
One of the major projects at Karolinska Institutet the last 3 years is called KIMKAT. It has gotten the moniker of a meta directory system, but in many ways that's not entirely correct. Since I've been one of the lead architects and developers in this project, I wanted to write about the technical architecture for the system, what choices panned out well, and things I would change if I could.
The problem
Karolinska Institutet is a fair sized university (quite big for Sweden). We have about 20k students and something like 5-7000 employees. Universities have a tendency to become decentralized, with local solutions for all problems. KI is no exception. Information about people at KI exists in more than 10 disparate systems only at the administration. This information costs much money to keep fresh, and it's also very inefficient. The vision with KIMKAT is to have one central source for all datums about persons, organizations and resources, where external systems can find up-to-date and current data. It should also be possible for these systems to contribute domain specific data into KIMKAT. (For example, our phone directory system should probably be the source for phone numbers.)
Different parts of the solution
We have worked with this problem in a project which has oscillated between 10 and 20 project members during 2½ years. Of course, there are many ways to solve this kind of problem, and the main differentiation between them is how much of a hack you want it to be. KI specifically wanted to avoid yet another hack solution (since other uni's in Sweden has gone down this route, and it seems both costly and becomes unmaintainable in only one or two years), so the focus of the the project group was to find and implement a solution that would hold for many years to come.
The first problem was that we had no definite source for organization information. Neither did we have anything for our affiliates. All that information was stored ad-hoc in the mail system, and on paper. So, we instigated a regime where each datum of information has one primary source which is responsible for maintaining and updating it. For employees this is our HR-system (called Primula). For students, it's our student database called LADOK. For organizations we created a new primary source called KOrg, and for affiliates KAff. We also created a database for all KIMKAT information that we couldn't write back to the other primary sources. This effectively became it's own primary source.
Our meta directory solution is based around a meta engine; a scripted system that reads and collects data from our primary sources and writes this in a unified data structure to the central KIMKAT database OmniaS. For this task we evaluated several different products, and also considered writing our own, but in the end we chose IBM Tivoli Directory Integrator, which is Java-based and very easy to use for easier situations. It uses BSF to allow scripting, which is almost always needed for more intricate solutions. Suffice to say, our final system contains lots of ITDI scripting.
The OmniaS database are surrounded with an EJB tier with Hibernate and accessors for reading the data in a well defined onthology implemented with JavaBeans. There are no Entity EJB's in KIMKAT. (Nor anywhere else at KI, as far as I know - and hope.)
Right now the primary user of the OmniaS information is our web interface, KKAWeb, which is used to update information in our primary sources. It's also used to establish new affiliations and organizations. There are some information the we deemed was necessary for KKAWeb and other applications that would eventually display KIMKAT-data in some way, but wasn't really business data. Because of this we created an external data source called KDis for this data. KKAWeb and another application called KIKAT uses this data extensively, but mostly for things like I18N and sorting.
Since our primary sources are of a very diverse nature, KKAWeb doesn't write to them directly. Instead the updated value objects are sent to an update topic with JMS, and every primary source has one or more message driven beans listening for just those messages that pertains to it. Then that bean updates the database.
The final part of the puzzle is another topic which ITDI sends all updates to, as soon as something is changed in one of the primary sources. This allows external systems to get hold of change events in whatever way they need. The change information is sent with JMS using S-expressions to represent data, since not all consumers will be Java-based.
Lessons learned
The KIMKAT project is late, and the deadline have been put off several times. There are several reasons for this, but the main one is probably due to our inability to give good time estimates. There are a few architectural and design decisions that I would probably have done another way if we would start everything over. The biggest one is there data source called KDis. That was a big mistake for several reasons. First of all, having joins between different databases is a pain. And having database constraints between databases are also really cumbersome.
If we did this again, I would probably drop one of the tiers between the sesssion EJB's and the OmniaS database. Right now, much time goes to serialize and deserialize between different kinds of value objects.
And third, our updating service is a really good idea, actually, but I think that the implementation could have been done in a better way. There is something nagging me with it, but I can't but the finger on it right now.
But all in all, our architecture have been really successful, and it feels like something that will be able to stand the test of time. Now we just have to integrate all other systems with this, so everyone gets all the real benefits from this solution.
The problem
Karolinska Institutet is a fair sized university (quite big for Sweden). We have about 20k students and something like 5-7000 employees. Universities have a tendency to become decentralized, with local solutions for all problems. KI is no exception. Information about people at KI exists in more than 10 disparate systems only at the administration. This information costs much money to keep fresh, and it's also very inefficient. The vision with KIMKAT is to have one central source for all datums about persons, organizations and resources, where external systems can find up-to-date and current data. It should also be possible for these systems to contribute domain specific data into KIMKAT. (For example, our phone directory system should probably be the source for phone numbers.)
Different parts of the solution
We have worked with this problem in a project which has oscillated between 10 and 20 project members during 2½ years. Of course, there are many ways to solve this kind of problem, and the main differentiation between them is how much of a hack you want it to be. KI specifically wanted to avoid yet another hack solution (since other uni's in Sweden has gone down this route, and it seems both costly and becomes unmaintainable in only one or two years), so the focus of the the project group was to find and implement a solution that would hold for many years to come.
The first problem was that we had no definite source for organization information. Neither did we have anything for our affiliates. All that information was stored ad-hoc in the mail system, and on paper. So, we instigated a regime where each datum of information has one primary source which is responsible for maintaining and updating it. For employees this is our HR-system (called Primula). For students, it's our student database called LADOK. For organizations we created a new primary source called KOrg, and for affiliates KAff. We also created a database for all KIMKAT information that we couldn't write back to the other primary sources. This effectively became it's own primary source.
Our meta directory solution is based around a meta engine; a scripted system that reads and collects data from our primary sources and writes this in a unified data structure to the central KIMKAT database OmniaS. For this task we evaluated several different products, and also considered writing our own, but in the end we chose IBM Tivoli Directory Integrator, which is Java-based and very easy to use for easier situations. It uses BSF to allow scripting, which is almost always needed for more intricate solutions. Suffice to say, our final system contains lots of ITDI scripting.
The OmniaS database are surrounded with an EJB tier with Hibernate and accessors for reading the data in a well defined onthology implemented with JavaBeans. There are no Entity EJB's in KIMKAT. (Nor anywhere else at KI, as far as I know - and hope.)
Right now the primary user of the OmniaS information is our web interface, KKAWeb, which is used to update information in our primary sources. It's also used to establish new affiliations and organizations. There are some information the we deemed was necessary for KKAWeb and other applications that would eventually display KIMKAT-data in some way, but wasn't really business data. Because of this we created an external data source called KDis for this data. KKAWeb and another application called KIKAT uses this data extensively, but mostly for things like I18N and sorting.
Since our primary sources are of a very diverse nature, KKAWeb doesn't write to them directly. Instead the updated value objects are sent to an update topic with JMS, and every primary source has one or more message driven beans listening for just those messages that pertains to it. Then that bean updates the database.
The final part of the puzzle is another topic which ITDI sends all updates to, as soon as something is changed in one of the primary sources. This allows external systems to get hold of change events in whatever way they need. The change information is sent with JMS using S-expressions to represent data, since not all consumers will be Java-based.
Lessons learned
The KIMKAT project is late, and the deadline have been put off several times. There are several reasons for this, but the main one is probably due to our inability to give good time estimates. There are a few architectural and design decisions that I would probably have done another way if we would start everything over. The biggest one is there data source called KDis. That was a big mistake for several reasons. First of all, having joins between different databases is a pain. And having database constraints between databases are also really cumbersome.
If we did this again, I would probably drop one of the tiers between the sesssion EJB's and the OmniaS database. Right now, much time goes to serialize and deserialize between different kinds of value objects.
And third, our updating service is a really good idea, actually, but I think that the implementation could have been done in a better way. There is something nagging me with it, but I can't but the finger on it right now.
But all in all, our architecture have been really successful, and it feels like something that will be able to stand the test of time. Now we just have to integrate all other systems with this, so everyone gets all the real benefits from this solution.
måndag, juni 12, 2006
A YAML dumper in Java
Today I've begun work on porting the RbYAML dumper to Java. The work will be done in three separate layers: the emitter, the serializer and the representer. I've just decided the structure for the emitter, but the basic premise is really simple since emitting is just about low level IO stuff.
The emitter will be a stack based finite state machine, and it actually closeley resembles the ParserImpl part of the loader. There are 17 states and they are represented as anonymous implementations of a simple interface. If this was a Java 5 project, I would've implemented it with an Enums, like this:
with anonymous Enum subclasses for each state. A very fine solution, since I can just fetch the next state and execute it directly, instead of doing something like this:
Alas, that's not the way it will be.
One of the more interesting parts of this implemention will be finding a nice way to represent the options hash from Syck and RbYAML 0.2. With standard Ruby keyword arguments, the options hash is very practical to use, instead of having to supply 17 different arguments - or even worse fact(17) different constructors to allow for default arguments. The correct way in Ruby is to have a defaults Hash, merge this with the arguments provided, and lookup the configurations options here. Obviously, I could do this with maps, but it's very cumbersome and requires more than one line to provide an option; something I would like to avoid.
The solution to this dilemma came from a part of Joshua Blochs Effective Java technical session at JavaOne. (I've written a little about it in my entry on JavaOne, day 2). Anyway, the solution is to provide an YAMLOptions class with a build factory method. I'm not totally finished with the syntax yet, but I'm thinking that a toYaml call could look something like this:
There are a few benefits with this syntax. The first is obvious: it takes a whole lot of options to span this over more than one line, and if you want that many, maybe you should save this options-object as a constant somewhere? Another great benefit is type safety. This isn't really an issue in Ruby, since there are no casts, but in Java I like the guarentee that the indent-option is an integer. I can save the defaults inside YAMLOptions, and if I define an interface for the YAMLOptions someone can implement it, while providing more options if needed. For example, right now I'm sure that JRubyYAMLOptions will be implemented; probably as a subclass to YAMLOptions, with some new options that only the JRuby-specific representers will use. Oh how I would love to be able to specify that static method build()with an interface...
The emitter will be a stack based finite state machine, and it actually closeley resembles the ParserImpl part of the loader. There are 17 states and they are represented as anonymous implementations of a simple interface. If this was a Java 5 project, I would've implemented it with an Enums, like this:
public enum EmitterState {
StreamStart {
public void expect(final EmitterEnvironment env) {
env.expectStreamStart();
}
}
}
with anonymous Enum subclasses for each state. A very fine solution, since I can just fetch the next state and execute it directly, instead of doing something like this:
STATES[STREAM_START].expect();
Alas, that's not the way it will be.
One of the more interesting parts of this implemention will be finding a nice way to represent the options hash from Syck and RbYAML 0.2. With standard Ruby keyword arguments, the options hash is very practical to use, instead of having to supply 17 different arguments - or even worse fact(17) different constructors to allow for default arguments. The correct way in Ruby is to have a defaults Hash, merge this with the arguments provided, and lookup the configurations options here. Obviously, I could do this with maps, but it's very cumbersome and requires more than one line to provide an option; something I would like to avoid.
The solution to this dilemma came from a part of Joshua Blochs Effective Java technical session at JavaOne. (I've written a little about it in my entry on JavaOne, day 2). Anyway, the solution is to provide an YAMLOptions class with a build factory method. I'm not totally finished with the syntax yet, but I'm thinking that a toYaml call could look something like this:
YAML.toYaml(theObj, YAMLOptions.build().useVersion(true).useDouble(true).indent(4));
There are a few benefits with this syntax. The first is obvious: it takes a whole lot of options to span this over more than one line, and if you want that many, maybe you should save this options-object as a constant somewhere? Another great benefit is type safety. This isn't really an issue in Ruby, since there are no casts, but in Java I like the guarentee that the indent-option is an integer. I can save the defaults inside YAMLOptions, and if I define an interface for the YAMLOptions someone can implement it, while providing more options if needed. For example, right now I'm sure that JRubyYAMLOptions will be implemented; probably as a subclass to YAMLOptions, with some new options that only the JRuby-specific representers will use. Oh how I would love to be able to specify that static method build()with an interface...
JvYAML, RbYAML and JRuby
There has been much activity on several points these last weeks, but the guiding light has been on getting JRuby as good as possible. JRuby 0.9 will arrive sometime this or next week, and to that point I managed to rework the emitter in RbYAML and release version 0.2 a few days ago. This release will be a part of JRuby 0.9. About 10 days ago I finished the parser implementation for JvYAML and released version 0.1 of it.
Today I've spent some time working on porting the emitter from RbYAML to Java too, but I haven't really gotten that far yet. I hope I can get it finished in time for JRuby 0.9, but it's no big catastrophe if not. The essential part with YAML emitting is that it's correct, not fast. But it would be really nice to get it included in this major release too. We'll have to see.
I realize I haven't really blogged about neither RbYAML 0.2 or JvYAML, just announcing them. But don't worry, it will arrive shortly.
As for now, the more interesting thing is JRuby. The 0.9 release will contain some truly spectacular things, and I'm prepared to say that JRuby is now usable in many commercial environments. We have RubyGems working correctly (albeit slowly, right now), we have the initial parts of Rails, we have the ActiveRecord JDBC connector, we have massive speed improvements. We have also received official sanctioning from Matz to include C Ruby's core libraries in the JRuby distribution (this means we no longer have to hack fileutils.rb each time checking out a new copy of JRuby). The IO work from Evan Buswell means Webrick will very soon work correctly from JRuby. All in all, it's a really major improvement, and I'm proud of being part of it. Stay tuned for more info.
Today I've spent some time working on porting the emitter from RbYAML to Java too, but I haven't really gotten that far yet. I hope I can get it finished in time for JRuby 0.9, but it's no big catastrophe if not. The essential part with YAML emitting is that it's correct, not fast. But it would be really nice to get it included in this major release too. We'll have to see.
I realize I haven't really blogged about neither RbYAML 0.2 or JvYAML, just announcing them. But don't worry, it will arrive shortly.
As for now, the more interesting thing is JRuby. The 0.9 release will contain some truly spectacular things, and I'm prepared to say that JRuby is now usable in many commercial environments. We have RubyGems working correctly (albeit slowly, right now), we have the initial parts of Rails, we have the ActiveRecord JDBC connector, we have massive speed improvements. We have also received official sanctioning from Matz to include C Ruby's core libraries in the JRuby distribution (this means we no longer have to hack fileutils.rb each time checking out a new copy of JRuby). The IO work from Evan Buswell means Webrick will very soon work correctly from JRuby. All in all, it's a really major improvement, and I'm proud of being part of it. Stay tuned for more info.
A few hours with Emacs
Today I've spent some hours upgrading Emacs to version 22, and getting various stuff to work. It was an interesting experience, and I've found lots of neat tricks I didn't know about before. But first, Emacs 22! This is in no way a released version, it is still in beta and reportedly has some bugs left in the code base. That said, I've been programming and working in it during the day and not experienced any trouble at all.
The new version brings some major updates, especially in the department of internationalization. The support looks great, actually. Another cool feature that Steve Yegge have blogged about (see link to the right), is the new support for embedding elisp inside replacement in "query-replace-regexp". That's really neat.
First installation wasn't totally smooth; my JDEE environment didn't really want to play, but after a while of fiddling with the code for it, I noticed that a new minor version was available. I added this to my local installation and everything justed worked. Even my own JDEE extensions continued working.
I've never really liked the completion in JDEE, but now I've finally written a small script to make is usable when I want it, at least:
This code does code-completion if at end of a line, which is the most common case when you need completion. It's important to have the jde-global-classpath point to all available classes, though. (I've created a global prj.el with common options preset for JDEE, and loads this from my various projects prj-files.) Completion is still slow, especially the first time, but when you really need it, it's there. I toyed with getting dabbrev or pabbrev in there also, but that just doesn't seem right for Java.
Another nice feature I got working was integrating imenu with JDEE, so I can Shift-right-click in any Java buffer and get a nice menu list of all class definitions with method and field hyperlinks. That's probably the only thing I'll use the mouse for in Emacs, ever, and it's more like using it as a key anyway. The only annoying thing with imenu is that I want it to sort the classes alphabetically. I've set this both with imenu-sort-function and jde-imenu-sort, but it doesn't really bite.
What I haven't had time to test yet - and the things that'll probably need some work - is my tramp-paths for interactively working with files on other computers through Emacs and plink (which also enables me to use Slime for LISP editing in-process on other servers). Right now this doesn't seem to work, but I read somewhere that the plink-support has been upgraded in Emacs 22.
But all in all I'm very satisfied. I recommend any Emacs maven to try out the new version, and especially if you need to work in Cygwin or on the Mac. The support for these environments are now there.
The new version brings some major updates, especially in the department of internationalization. The support looks great, actually. Another cool feature that Steve Yegge have blogged about (see link to the right), is the new support for embedding elisp inside replacement in "query-replace-regexp". That's really neat.
First installation wasn't totally smooth; my JDEE environment didn't really want to play, but after a while of fiddling with the code for it, I noticed that a new minor version was available. I added this to my local installation and everything justed worked. Even my own JDEE extensions continued working.
I've never really liked the completion in JDEE, but now I've finally written a small script to make is usable when I want it, at least:
(defun indent-or-complete-jde ()
"Complete if point is at end of a line, otherwise indent line."
(interactive)
(if (looking-at "$")
(jde-complete)
(indent-for-tab-command)))
(define-key jde-mode-map [tab] 'indent-or-complete-jde)
This code does code-completion if at end of a line, which is the most common case when you need completion. It's important to have the jde-global-classpath point to all available classes, though. (I've created a global prj.el with common options preset for JDEE, and loads this from my various projects prj-files.) Completion is still slow, especially the first time, but when you really need it, it's there. I toyed with getting dabbrev or pabbrev in there also, but that just doesn't seem right for Java.
Another nice feature I got working was integrating imenu with JDEE, so I can Shift-right-click in any Java buffer and get a nice menu list of all class definitions with method and field hyperlinks. That's probably the only thing I'll use the mouse for in Emacs, ever, and it's more like using it as a key anyway. The only annoying thing with imenu is that I want it to sort the classes alphabetically. I've set this both with imenu-sort-function and jde-imenu-sort, but it doesn't really bite.
What I haven't had time to test yet - and the things that'll probably need some work - is my tramp-paths for interactively working with files on other computers through Emacs and plink (which also enables me to use Slime for LISP editing in-process on other servers). Right now this doesn't seem to work, but I read somewhere that the plink-support has been upgraded in Emacs 22.
But all in all I'm very satisfied. I recommend any Emacs maven to try out the new version, and especially if you need to work in Cygwin or on the Mac. The support for these environments are now there.
söndag, juni 11, 2006
Announcing RbYAML version 0.2
Another major release, with most changes in the dumper:
http://rbyaml.rubyforge.org
or to download directly:
http://rubyforge.org/frs/?group_id=1658
Changes:
http://rbyaml.rubyforge.org
or to download directly:
http://rubyforge.org/frs/?group_id=1658
Changes:
- Performance has been greatly improved
- Rewritten the representer to use a distributed representation model
- Much improvement of test cases
- And many bug fixes
tisdag, juni 06, 2006
Announcing JvYAML.
I am pleased to announce JvYAML, version 0.1. JvYAML is a Java YAML 1.1 loader that is both easy to extend and easy to use. JvYAML originated in the JRuby project (http://jruby.sourceforge.net), from the base of RbYAML (http://rbyaml.rubyforge.org). For a long time Java have lacked a good YAML loader and dumper with all the features that the SYCK using scripting communities have gotten used to. JvYAML aims to rectify this.
Of major importance is that JvYAML works the same way as SYCK, so that JRuby can rely on YAML parsing and emitting that mirrors C Ruby.
JvYAML is a clean port of RbYAML, which was a port from Python code written by Kirill Simonov for PyYAML3000.
Simple usage:
More information:
At java.net: http://jvyaml.dev.java.net
Download: https://jvyaml.dev.java.net/servlets/ProjectDocumentList
License:
JvYAML is distributed with the MIT license.
Of major importance is that JvYAML works the same way as SYCK, so that JRuby can rely on YAML parsing and emitting that mirrors C Ruby.
JvYAML is a clean port of RbYAML, which was a port from Python code written by Kirill Simonov
Simple usage:
import org.jvyaml.YAML;There is also support for more advanced loading of JavaBeans with automatic setting of properties with the use of domain tags in the YAML document.
Map configuration = (Map)YAML.load(new FileReader("c:/projects/ourSpecificConfig.yml"));
List values = (List)YAML.load("--- \n- A\n- b\n- c\n");
More information:
At java.net: http://jvyaml.dev.java.net
Download: https://jvyaml.dev.java.net/servlets/ProjectDocumentList
License:
JvYAML is distributed with the MIT license.
lördag, juni 03, 2006
Transforming RbYAML
RbYAML went through some big changes from release 0.0.2 to 0.1. My intentions are to detail some of these changes, what implementation choices I did, and why.
First, conversion from Mixins to Classes. The original Python implementation used multiple inheritance, created several base classes (Reader, Scanner, Parser, etc) and then created one several versions of a Loader class which inherited from the different base classes. My first implementation mirrored this approach, but used Modules instead of base classes and mixed in different versions of these in the different Loader classes. This approach was quite limiting since mixing in code into other Modules doesn't really work as you expect, and this is no substitute for subclassing. For example, I had a BaseResolver module, a SafeResolver module which mixed in BaseResolver and added code of it's own, but this were quite cumbersome.
The solution to this was simply to convert all Modules to class, and make all calls to the other tiers explicit. For example, instead of having the Parser module just assume that you've mixed in a Scanner and call check_token on itself, I have the Parser class take a Scanner instance at initialization and call check_token on this instance instead.
This works very well, and probably makes the code easier to understand. Another positive of this is that the interface between the layers are more apparent. For inclusion in JRuby, this will make it easier to replace certain parts with Java implementations.
The next piece on the agenda was a rewrite of the Parser. The original Python implementation used Python generators (which are almost like coroutines, but not quite). My first port of this code just parsed the whole stream, saved all events and then passed these on after parsing. This was good enough for smaller YAML documents, but when trying to parse the RubyGems gemspec, the memory and time requirements became to prohibitive. In the course of making the generator algorithm explicit I totally rewrote the Parser from the beginning, making it hybrid table driven instead of recursive-descent as the original was. I actually believe the new Parser is both easier to understand and faster. Just as an example, this is the code for block_sequence:
where @parse_stack contains the next productions to call after block_sequence has finished. The main generator method just keeps calling the next production until it arrives to a terminal, and then returns the value of this:
Another benefit of this is that this code is dead simple to port to other languages, once again probably easier than the Python version.
The third improvement was performance. I have no trustworthy numbers of the improvement, but it's in the order of 5-8 times faster than from the beginning. I achieved by some easy fixes, and some harder ones. I removed the Reader class and inlined those methods into the Scanner. I tested each case where I tested if a character was part of a String and checked were a Regexp was faster. And added some hard coded, unrolled loops in the most intense parts of the code, which was peek(), forward(), prefix() and update(). Every microsecond improvement in these methods counted since they are called so many times. I didn't do all this work blind, though. The Ruby profiler is really good. Just take a script, run it with ruby -rprofile script.rb and you get output that's incredibly good. I tested most of my changes this way, and the end result is about as fast as the JRuby RACC-based YAML parser, which was my goal.
Since version 0.1 I've spent some time getting JRuby to work flawlessly with RubyGems, and this work have uncovered some small bugs in RbYAML (and in SYCK, for that matter), so a new minor release will probably come soon. Until then the CVS is up to date.
First, conversion from Mixins to Classes. The original Python implementation used multiple inheritance, created several base classes (Reader, Scanner, Parser, etc) and then created one several versions of a Loader class which inherited from the different base classes. My first implementation mirrored this approach, but used Modules instead of base classes and mixed in different versions of these in the different Loader classes. This approach was quite limiting since mixing in code into other Modules doesn't really work as you expect, and this is no substitute for subclassing. For example, I had a BaseResolver module, a SafeResolver module which mixed in BaseResolver and added code of it's own, but this were quite cumbersome.
The solution to this was simply to convert all Modules to class, and make all calls to the other tiers explicit. For example, instead of having the Parser module just assume that you've mixed in a Scanner and call check_token on itself, I have the Parser class take a Scanner instance at initialization and call check_token on this instance instead.
This works very well, and probably makes the code easier to understand. Another positive of this is that the interface between the layers are more apparent. For inclusion in JRuby, this will make it easier to replace certain parts with Java implementations.
The next piece on the agenda was a rewrite of the Parser. The original Python implementation used Python generators (which are almost like coroutines, but not quite). My first port of this code just parsed the whole stream, saved all events and then passed these on after parsing. This was good enough for smaller YAML documents, but when trying to parse the RubyGems gemspec, the memory and time requirements became to prohibitive. In the course of making the generator algorithm explicit I totally rewrote the Parser from the beginning, making it hybrid table driven instead of recursive-descent as the original was. I actually believe the new Parser is both easier to understand and faster. Just as an example, this is the code for block_sequence:
def block_sequence
@parse_stack += [:block_sequence_end, :block_sequence_entry, :block_sequence_start]
nil
end
where @parse_stack contains the next productions to call after block_sequence has finished. The main generator method just keeps calling the next production until it arrives to a terminal, and then returns the value of this:
def parse_stream_next
if !@parse_stack.empty?
while true
meth = @parse_stack.pop
val = send(meth)
if !val.nil?
return val
end
end
else
return nil
end
end
Another benefit of this is that this code is dead simple to port to other languages, once again probably easier than the Python version.
The third improvement was performance. I have no trustworthy numbers of the improvement, but it's in the order of 5-8 times faster than from the beginning. I achieved by some easy fixes, and some harder ones. I removed the Reader class and inlined those methods into the Scanner. I tested each case where I tested if a character was part of a String and checked were a Regexp was faster. And added some hard coded, unrolled loops in the most intense parts of the code, which was peek(), forward(), prefix() and update(). Every microsecond improvement in these methods counted since they are called so many times. I didn't do all this work blind, though. The Ruby profiler is really good. Just take a script, run it with ruby -rprofile script.rb and you get output that's incredibly good. I tested most of my changes this way, and the end result is about as fast as the JRuby RACC-based YAML parser, which was my goal.
Since version 0.1 I've spent some time getting JRuby to work flawlessly with RubyGems, and this work have uncovered some small bugs in RbYAML (and in SYCK, for that matter), so a new minor release will probably come soon. Until then the CVS is up to date.
Getting RubyGems to work with JRuby
I'm sorry if the title gives it away, but here are some recent output in my terminal window:
So, what do you need to do, to get this working?
#bin/jruby bin/gem install rails --include-dependenciesSo, we have RubyGems mostly working. Right now there are two caveats. First, during the YAML parsing, we get some InterruptedExceptions for some reason. This doesn't seem to impair functionality, though. The second problem is that it takes serious time. Between 30 minutes and an hour for this. The two parts that are time hogs are the YAML parsing of the Gemspec, and the RDoc stuff, for some reason.
Attempting local installation of 'rails'
Local gem file not found: rails*.gem
Attempting remote installation of 'rails'
Updating Gem source index for: http://gems.rubyforge.org
Successfully installed rails-1.1.2
Successfully installed activesupport-1.3.1
Successfully installed activerecord-1.14.2
Successfully installed actionpack-1.12.1
Successfully installed actionmailer-1.2.1
Successfully installed actionwebservice-1.1.2
Installing RDoc documentation for activesupport-1.3.1...
Installing RDoc documentation for activerecord-1.14.2...
Installing RDoc documentation for actionpack-1.12.1...
Installing RDoc documentation for actionmailer-1.2.1...
Installing RDoc documentation for actionwebservice-1.1.2...
So, what do you need to do, to get this working?
- Start from a newly checked out JRuby.
- Add patch for RubyTime and TimeMetaClass. (Adds gmt_offset and utc_offset. This patch can be found in the jruby-devel archives.)
- Checkout the latest version of RbYAML from RubyForge, and put this in $JRUBY_HOME/lib/ruby/site_ruby/1.8.
- Add the contents from the C Ruby libraries.
- Change fileutils.rb, so that RUBY_PLATFORM works.
- Replace the file $JRUBY_HOME/src/builtin/yaml.rb with the yaml.rb for RbYAML, that can be found here.
- Change the jruby and jirb scripts by adding -Xmx512M. (I'm not sure 512 is really needed, actually. Maybe 256 or 128 suffices.)
Prenumerera på:
Inlägg (Atom)