onsdag, oktober 31, 2007

An interesting memory leak in JRuby

The last two days I had lots of fun with the interesting task of finding a major memory leak in JRuby. The only way I could reliably reproduce it was by running Mingle's test suite and see memory being eaten. I tried several approaches, the first being using jhat to analyze the heap dumps. That didn't really help me much, since all the interesting queries I tried to run with OQL had a tendency to just cause out of memory errors. Not nice.

Next step was to install SAP Memory Analyzer which actually worked really well, even though it's built on top of Eclipse. After several false starts, including one where I thought we had found the memory leak I finally got somewhere. Actually, I did find a memory leak in our method cache implementation. But alas, after fixing that it was obvious there was another leak in there.

I finally got SAP to tell me that RubyClasses was being retained. But when I tried to find the root chain to see how that happened I couldn't see anything strange. In fact, what I saw what the normal chaining of frames, blocks, classes and other interesting parts. And this is really the problem when debugging this kind of problem in JRuby. Since a leak almost always be leaking several different objects, it can be hard to pinpoint the exact problem. In this case I guess that the problem was in a large branch that Bill merged a few weeks back, so I tried going back to it and checking. Alas, the branch was good. In fact, since I went back 200 revisions I finally knew within which range the problem had to be. Since I couldn't find anything more from the heap dumps I resorted to the venerable tradition of binary search. Namely going through the revisions and finding the faulty one. According to log2, I would find the bad revision in less than 8 tries, so I started out.

After I while I actually found the problem. Let me show it to you here:
def __jtrap(*args, &block)
sig = args.first
sig = SIGNALS[sig] if sig.kind_of?(Fixnum)
sig = sig.to_s.sub(/^SIG(.+)/,'\1')
signal_class = Java::sun.misc.Signal
signal_class.send :attr_accessor, :prev_handler
signal_object = signal_class.new(sig) rescue nil
return unless signal_object
signal_handler = Java::sun.misc.SignalHandler.impl do
begin
block.call
rescue Exception => e
Thread.main.raise(e) rescue nil
ensure
# re-register the handler
signal_class.handle(signal_object, signal_handler)
end
end
signal_object.prev_handler = signal_class.handle(signal_object, signal_handler)
end
This is part of our signal handling code. Interestingly enough, I was nonplussed. How could trap leak? I mean, noone actually calls trap enough times to make it leak, right?

Well, wrong. Actually, it seems that ActiveRecord traps abort in transactions and then restore the original handler. So each transaction created new trap handlers. That would have been fine, except for the last line. In effect, in the current signal handler we save a reference to the previous signal handler. After a few iterations we will have a long chain of signal handlers, all pointing back, all holding a hard reference from one of the single static root sets in the JVM (namely, the list of all signal handlers). That isn't so bad though. Except a saved block has references to dynamic scopes (which reference variables). It has a reference to the Frame, and the Frame has references to RubyClass. RubyClass has references to method objects, and method objects have in some cases references to RubyProcs, which in turn have more references to Blocks. At the end, we have a massive leak.

The solution? To simple remove saving of the previous handler and simplify the signal handler.

2 kommentarer:

Anonym sa...

It might be interesting to investigate Tom Copeland's memory leak problem, which may not occur on JRuby.

http://tomcopeland.blogs.com/juniordeveloper/2007/09/tracking-down-a.html

I couldn't get the to bottom of that. Ruby-Core:12284
There's a prev link in the garbage collector in the C version.

Hugh

Anonym sa...

"Caching implies policy"

http://blogs.msdn.com/ricom/archive/2007/06/25/caching-redux.aspx