Ola Bini: Programming Language Synchronicity: ruby

Visar inlägg med etikett ruby. Visa alla inlägg

torsdag, augusti 14, 2008

Where is the Net::SSH bug

Yesterday I spent several hours trying to find the problem with our implementation of OpenSSL Cipher, that caused the Net::SSH gem to fail miserable during negotiation and password verification. After various false leads I finally found the reason for the strange behavior. But I really can't decide if it's a bug, and if it's a bug where the bug is. Is it in Ruby's interface to OpenSSL, or is it in Net::SSH?

No matter what cipher suite you use for SSH, you generally end up using a block cipher, mostly something like CBC. That means an IV (initialization vector) is needed, together with a key. The relevant parts of OpenSSL used is the EVP_CipherInit, EVP_CipherUpdate and EVP_CipherFinal family of methods. Nothing really strange there. The Ruby interface matches these methods quite closely; every time you set a key, or an IV, or some other parameter, the CipherInit method is called with the relevant data. When CipherUpdate is called, the actual enciphering or deciphering starts happening, and CipherFinal takes care of the final block.

At the point EVP_CipherFinal is called, nothing more should be done using the specific Cipher context. Specifically, no more Update operations should be used. The man page has this to say about the Final-methods:

After this function is called the encryption operation is finished and no further calls to EVP_EncryptUpdate() should be made.

Now, what I found was that same documentation is not part of the Ruby interface. And Net::SSH is actually reusing the same Cipher object after final has been called on it. Specifically, it continues the conversation, calling update a few times and then final. The general flow for a specific Cipher object in Net::SSH is basically init->update->update->final->update->update->final.

So what is so bad about this then? Well, the question is really this: what IV will the operations after the first final call be using? The assumption I made is that obviously it will use the original IV set on the object. Something else would seem absurd. But indeed, the IV used is actually the last IV-length bytes of encrypted data returned. Is this an obvious or intended effect at some level? Probably not, since the OpenSSL documentation says you shouldn't do it. The reason it works that way is because the temporary buffer used in the Cipher context isn't cleared out at the end of the call to final.

In contrast, the Java Cipher object will call reset() as part of the call to doFinal(). Where reset() will actually reset the internal buffers to use the original IV. So the solution is simple for encryption. Just save away 8 or 16 bytes of the last generated crypto text and set that manually as the IV after the call to doFinal. And what about decryption? Well, here the IV needs to be the last crypto text sent in for deciphering, not the result of the last operation.

So Net::SSH seems to work fine with JRuby now. I'm about to release a new version of JRuby-OpenSSL including these and many other things.

But the question remains. Is it a bug? If it is, is it in the Ruby OpenSSL integration, or in the Net::SSH usages of Ciphers? If it's in the Net::SSH code, why does it actually work correctly when communicating with an SSH server? Or is this behavior of using the last crypto text as IV something documented in the SSH spec?

Enlightenment would be welcome.

fredag, juli 04, 2008

Java and mocking

I've just spent my first three days on a project in Leeds. It's a pretty common Java project, RESTful services and some MVC screens. We have been using Mockito for testing which is a first for me. My immediate impression is quite good. It's a nice tool and it allows some very clean testing of stuff that generally becomes quite messy. One of the things I like is how it uses generics and the static typing of Java to make it really easy to make mocks that are actually type checked; like this for example:

Iterator iter = mock(Iterator.class);
stub(iter.hasNext()).toReturn(false);

// Call stuff that starts interaction

verify(iter).hasNext();

These are generally the only things you need to stub stuff out and verify that it was called. The things you don't care about you don't verify. This is pretty good for being Java, but there are some problems with it too. One of the first things I noticed I don't like is that interactions that isn't verified can't be disallowed in an easy way. Optimally this would happen at the creation of the mock, instead of actually calling the verifyNoMoreInteractions() afterwards instead. It's way to easy to forget. Another problem that quite often comes up is that you want to mock out or stub some methods but retain the original behavior of others. This doesn't seem possible, and the alternative is to manually create a new subclass for this. Annoying.

Contrast this to testing the same interaction with Mocha, using JtestR, the difference isn't that much, but there is some missing cruft:

iter = mock(Iterator)
iter.expects(:hasNext).returns(false)

# Call stuff that starts interaction

Ruby makes the checking of interactions happen automatically afterwards, and so you don't have any types you don't need to care about most stuff the way you do in Java. This also shows a few of the inconsistencies in Mockito, that is necessary because of the type system. For example, with the verify method you send the mock as argument and the return value of the verify-method is what you call the actual method on, to verify that it's actually called. Verify is a generic method that returns the same type as the argument you give to it. But this doesn't work for the stub method. Since it needs to return a value that you can call toReturn on, that means it can't actually return the type of the mock, which in turn means that you need to call the method to stub before the actual stub call happens. This dichotomy gets me every time since it's a core inconsistency in the way the library works.

Contrast that to how a Mockito like library might look for the same interaction:

iter = mock(Iterator)
stub(iter).hasNext.toReturn(false)

# Do stuff

verify(iter).hasNext

The lack of typing makes it possible to create a cleaner, more readable API. Of course, these interactions are all based on how the Java code looked. You could quite easily imagine a more free form DSL for mocking that is easier to read and write.

Conclusion? Mockito is nice, but Ruby mocking is definitely nicer. I'm wondering why the current mocking approaches doesn't use the method call way of defining expectations and stubs though, since these are much easier to work with in Ruby.

Also, it was kinda annoying to upgrade from Mockito 1.3 to 1.4 and see half our tests starting to fail for unknown reasons. Upgrade cancelled.

tisdag, juni 17, 2008

Testing Regular Expressions

Something has been worrying me a bit lately. Being test infected and all, and working for ThoughtWorks, where testing is part of the life blood, I think more and more about these issues. And one thing I've started noticing is that regular expressions seems to be a total blind spot in many cases. I first started thinking about it when I changed a quite complicated regular expression in RSpec. Now RSpec has coverage tests as part of their build, and if the test coverage is less than a 100%, the build will fail. Now, since I had changed something to add new functionality, but hadn't added any tests for it, I instinctively assumed that it would be caught be the coverage tool.

Guess what? It wasn't. Of course, if I had changed the regexp to do something that the surrounding code couldn't support, one of the tests for surrounding lines of code would have caught it, but I got no mention from the coverage tool that I needed more tests to fully handle the regular expressions. This is logical if you think about it. There is no way that a coverage tool could find all the regular expressions in your source code, and then make sure that all branches and alternatives of that particular regular expression was exercised. So that means that the coverage tool doesn't do anything with them at all.

OK, I can live with that, but it's still one of those points that would be very good to keep in mind. Every time you write a regular expression in your code, you need to take special care to actually exercise that part of the code with many inputs. What is many in this case? That's another part of the problem - it depends on the regular expression. It depends on how complicated it is, how long it is, how many special operators are used, and so on. There is no real way around it. To test a regular expression, you really need to understand how they work. The corollary is obvious - to use a regular expression in your code, you need to know how to test it. Conclusion - you need to understand regular expressions.

In many code bases I haven't seen any tests for regular expressions at all. In most cases these have been crafted by writing them outside the code, testing them by hand, and then putting them in the code. This is brittle to say the least. In the cases where there are tests, it's much more common that they only test positives, and not negatives. And I've seldom heard of code bases with enough tests for regular expressions. One of the problems is that in a language like Ruby, they are so easy to use, so you stick them in all over the place. A standard refactoring could help here, by extracting all literal regular expressions to constants. But then the problem becomes another - as soon as you use regular expressions to extract values from a string, it's a pain to not have the regular expression at the same place as the extracted groups are used. Example:

PhoneRegexp = /(\d{3})-?(\d{4})-?(\d{4})/
# 200 lines of code
if phone_number =~ PhoneRegexp
puts "phone number is: #$1-#$2-#$3"
end

If the regular expression had been at the same place as the usage of the $1, $2 and $3 it would have been easy to tie them to the parts of the string. In this case it would be easy anyway, but in more complicated cases it's more complicated. The solution to this is easy - the dollar numbers are evil: don't use them. Instead use an idiom like this:

area, number, extension = PhoneRegexp.match(phone_number).captures

In Ruby 1.9 you will be able to use named captures, and that will make it even easier to make readable usage of the extracted parts of a string. But fact is, the difference between the usage point and the definition point can still cause trouble. A way of getting around this would be to take any complicated regular expression and putting it inside of a specific class for only that purpose. The class would then encapsulate the usage, and would also allow you to test the regular expression more or less in isolation. In the example above, maybe creating a PhoneNumberParser would be a good idea.

At the end of the day, regular expressions are an extremely complicated feature, and in general we don't test the usage of them enough. So you should start. Begin by first creating both positive and negative tests for them. Figure out the boundaries, and see where they can go wrong. Know regular expressions well enough to know what happens in these strange circumstances. Think about unicode characters. Think about whitespace. Think about greedy and lazy matching. As an example of something that took a long time to cause trouble; what's wrong with this regexp that tries to discern if a string is a select statement or not?

/^\s*\(*\s*SELECT\W+/i

And this example actually covers most of the ground, already. It checks case insensitive. It checks for white space before any optional parenthesis, and for any white space after. It makes sure that the word SELECT isn't continued by checking for at least one non word character. So what's wrong with it? Well... It's the caret. Imagine if we had a string like this:

"INSERT INTO foo(a,b,c)\nSELECT * FROM bar"

The regular expression will in fact match this, even though it's not a select statement. Why? Well, it just so happens that the caret matches the beginning of lines, not the beginning of strings. The dollar sign works the same way, matching the end of lines. How do you solve it? Change the caret to \A and the dollar sign to \Z and it will work as expected. A similar problem can show up with the "." to match any character. Depending on which language you are using, the dot might or might not match a newline. Always make sure you know which one you want, and what you don't want.

Finally, these are just some thoughts I had while writing it. There is much more advice to give, but it can be condensed to this: understand regular expressions, and test them. The dot isn't as simple as it seem. Regular expressions are a full blown language, even though it's not turing complete (in most implementations). That means that you can't test it completely, in the general case. This doesn't mean you shouldn't try to cover all eventualities.

How are you testing your regular expressions? How much?

tisdag, juni 10, 2008

Ruby can't be good since I won't bother learning it...

Best quote this whole day, found in http://www.codinghorror.com/blog/archives/001131.html#comments.

If Ruby offered something new I would have learned it fine tbh... its just difficult enough to not be able to "pick up and run with" like almost everything else out there... but honestly, it wouldn't let me do anything I can't already do.

My brain almost exploded reading that.

tisdag, juni 03, 2008

RailsConf 2008

I've landed, gotten mostly back in the right timezone without too many incidents (except running through SFO to board very badly scheduled connection).

After allowing the impressions from the last 6-7 days to sink in a little, it's time to summarize RailsConf. I'll go through the sessions I saw and then do some concluding remarks.

The first day was tutorials. I had a good time in Neal Fords and Pat Farleys tutorial on Metaprogramming. I can't say I learned much from the sessions, but it was very good content, extremely well presented, and I got the impression that many in the room learned lots of crucial things. The kind of knowledge about internals you get from a talk like this allows you to understand how metaprogramming in Ruby actually works, which makes it easier to achieve the effects you want.

After that I sat around hacking in the Community Code Drive for the rest of the day, with lots of other people. I wasn't involved in gitjour (which by the way is incredibly cool), but I did manage to find a memory leak in iTerms Bonjour handling due to gitjour. Neat. Me and David Chelimsky paired on getting support for multiline plain text story arguments into RSpec, and by the end of the afternoon it was in.

Finally, we headed out to the JRuby hackfest, which ended up being over full with people. That's a good problem to have. We had a great time, hacking on different things, helping people to get started and debugging various problems. All in all it was a very productive day.

I began the Friday with Joel Spolsky's keynote. In contrast to many other people I didn't like it. There wasn't really any content at all, just some humorous content and lots of jokes about naked women. I expect something a bit more profound for the first keynote of the conference, since they have a tendency to actually set the standard for the rest of the days.

After the keynote, John Lam showed off IronRuby running a few simple Rails requests. This is a great achievement, and I'm very impressed with their results. I have argued that IronRuby would probably never reach this point, and I'm very happy to admit I was wrong and offer my apologies to John Lam and the IronRuby team. That said, the fact that IronRuby runs a few different Rails requests is not the same thing as saying that IronRuby runs Rails. My personal definition of running Rails is more about having the Rails test suite run at a high percentage of success (something like 96-98% would be good enough for almost all Rails apps to work, provided they are the right 98%). (ED: Evan Phoenix just told me that MRI doesn't run the Rails test suite totally clean either, because of the way the Rails development process works. So a 100% is probably not a good measure of Rails compatibility.) I assume that this is going to be the next goal for the IronRuby team, and I wish them good luck.

I saw the Hosting talk after that, but I have to admit I was wrapped up in a seriously annoying JRuby bug at the moment so I didn't really pay attention.

The DataMapper talk was very full and gave a good overview of why DataMapper might be a better choice than AR in many cases. The presentation style could possibly have been a bit less dry, but the content was definitely delicious.

If the next two days were the JRuby days, the Friday was the day for all other alternative implementations. I sat in on the Rubinius talk by Evan Phoenix and friends, and then the much talked about MagLev presentation.

I first want to congratulate Rubinius on running several different Rails requests. It's very cool and a great milestone. The same caveats as for IronRuby applies of course. But wow, the debugging features is awesome. First class meta objects are extremely powerful, and will provide many capabilities to the platform. The presentation was also extremely entertaining. One of the best presentations for the sheer fun everyone seemed to have. Props to Evan, Brian and Wilson for this.

So. The MagLev talk. First, there seems to be some misunderstandings about what MagLev actually is. It is not a hosting service. Gemstone might offer a hosting service around MagLev in the future, but that's not what is going on here. MagLev is a new virtual machine for Ruby, based on Gemstone/S. Basing it on a Smalltalk machine makes it very easy for Gemstone to implement a large subset of Ruby and having it running cleanly and with good performance. Exactly how much has been implemented at this point is not really clear, since no major applications run, and the RubySpecs have not been used on it yet. I assume that the implementation doesn't handle enough Ruby features yet to be able to run the mspec runner and other important machinery.

Was this presentation important? Yeah, sure. To a degree. It was a cool presentation, whetting peoples appetite by showing something that might some day become a real Ruby platform with built in support for an incredible OODB. But it's still early days.

The Saturday began with Jeremy's keynote. He talked about the new things in Rails 2.1 and also showed the same app running in Ruby 1.8, 1.9, Rubinius and JRuby. Very cool.

I ended up in Nathaniel Talbotts 23 Hacks session which was fun. Good stuff.

After that the JRuby day began in earnest with Nick's talk about deploying JRuby on Rails. This was mostly the same talk as given at JavaOne, but more geared towards Ruby programmers. Useful information.

Dan Manges and Zak Tamsen gave an extremely useful talk about how to test Rails applications correctly. Very good material. Exactly the strong kind of deep technical knowledge, gained by experience, that people go to conferences to get.

My talk about JRuby on Rails was generally well received. I had a fun time, and of course I managed to run out of time as usual. I wonder why I'm always afraid of running out of material. That has never happened when I'm talking bout JRuby.

The final technical session of the day ended up being a walk-around to all the different presentations going on and taking a peek, and then ending up hacking in the speakers room.

The evening keynote was by Kent Beck, and as usual he is fantastic to listen to.

The Sunday started with the CS nerds anonymous session, held by Evan Phoenix. It ended up being a kind of lightning talk session, and had some nice points.

After that Ezra gave his talk - that had nothing to do with the session title. He presented Vertebra, which is a cloud computing control system, based on XMPP, Erlang and the actors model. Very cool stuff, although it might not be that useful for people who aren't in charge of a quite large number of computers. But if you have your own botnet, this might be the best way to control them all. =)

The final session of the day was the JRuby Q&A session, which basically flew by. The first ten minutes went in normal time, and then suddenly the session was over. I think we had good attendance, and the right level of questions. You can see all the points covered in Nicks blog, here.

And then it was over.

So, what was good? The technical level was definitely deeper and more rooted in experience. I have to say that this was probably the best Ruby conference I've been to, based on the depth and level of the presentations. Kudos to the scheduling people.

And what was bad? A little bit too much hype about MagLev, and everyone's tendency to use dark colors on black backgrounds in their presentations. Hey, they look good on your computer screen, but it's really not readable!

torsdag, maj 29, 2008

Ruby doesn't have meta classes

OK. It's time to get rid of this terminology problem. Ruby does NOT have meta classes. You can define them yourself, but it's not the same thing as what is commonly called the meta class. That is more correctly called the eigen class. The singleton class is also better than meta class, but eigen class is definitely the most correct term.

So what is a meta class then? Well, it's a class that defines the behavior of other classes. You can define meta classes in Ruby if you want too by defining a subclass of Class. Those classes would be metaclasses.

Edit: Of course, if you actually try to define a subclass of Class you will find that Ruby doesn't allow you to do that, which means that you don't have any meta classes in Ruby. Period.

Ruby closures addendum - yield

This should probably have been part of one of my posts on closures or defining methods, but I'm just going to write this separately, because it's a very common mistake.

So, say that I want to have a class, and I want to send a block when creating the instance of this class, and then be able to invoke that block later, by calling a method. The idiomatic way of doing this would be to use the ampersand and save away the block in an instance variable. But maybe you don't want to do this for some reason. An alternative would be to create a new singleton method that yields to the block. A first implementation might look like this:

class DoSomething
def initialize
  def self.call
    yield
  end
end
end

d = DoSomething.new do
puts "hello world"
end

d.call
d.call

But this code will not work. Why not? Because as I mentioned in my post about defining methods, "def" will never create a closure. Why is this important? Well, because the current block is actually part of the closure. The yield keyword will use the current frames block, and if the method is not defined as a closure, the block invoked by the yield keyword will actually be the block sent to the "call" method. Since we don't provide a block to "call", things will fail.

To fix this is quite simple. Use define_method instead:

class DoSomething
 def initialize
   (class << self; self; end).send :define_method, :call do
     yield
   end
 end
end

d = DoSomething.new do
 puts "hello world"
end

d.call
d.call

As usual with define_method we need to open the singleton class, and use send. This will work, since the block sent to define_method is a real closure, that closes over the block sent to initialize.

torsdag, maj 15, 2008

Dynamically created methods in Ruby

There seems to be some confusion with regards to dynamically defining methods in Ruby. I thought I'd take a look at the three available methods for doing this and just quickly note why you'd use one method in favor of another.

Let's begin by a quick enumeration of the available ways of defining a method after the fact:

Using a def
Using define_method
Using def inside of an eval

There are several things to consider when you dynamically define a method in Ruby. Most importantly you need to consider performance, memory leaks and lexical closure. So, the first, and simplest way of defining a method after the fact is def. You can do a def basically anywhere, but it needs to be qualified if you're not immediately in the context of a module-like object. So say that you want to create a method that returns a lazily initialized value, you can do it like this:

class Obj
  def something
  puts "calling simple"
    @abc = 3*42
    def something
    puts "calling memoized"
    @abc
    end
  something
  end
end

o = Obj.new
o.something
o.something
o.something

As you can see, we can use the def keyword inside of any context. Something that bites most Ruby programmers at least once - and more than once if they used to be Scheme programmers - is that the second def of "something" will not do a lexically scoped definition inside the scope of the first "something" method. Instead it will define a "something" method on the metaclass of the currently executing self. This means that in the example of the local variable "o", the first call to "something" will first calculate the value and then define a new "something" method on the metaclass of the "o" local variable. This pattern can be highly useful.

Another variation is quite common. In this case you define a new method on a specific object, without that object being the self. The syntax is simple:

def o.something
puts "singleton method"
end

This is deceptively simple, but also powerful. It will define a new method on the metaclass of the "o" local variable, constant, or result of method call. You can use the same syntax for defining class methods:

def String.something
puts "also singleton method"
end

And in fact, this does exactly the same thing, since String is an instance of the Class class, this will define a method "something" on the metaclass of the String object. There are two other idioms you will see. The first one:

class << o
  def something
  puts "another singleton method"
  end
end

does exactly the same thing as

def o.something
puts "another singleton method"
end

This idiom is generally preferred in two cases - first, when defining on the metaclass of self. In this case, using this syntax makes what is happening much more explicit. The other common usage of this idiom is when you're defining more than one singleton method. In that case this syntax provide a nice grouping.

The final way of defining methods with def is using module_eval. The main difference here is that module_eval allows you to define new instance methods for a module like object:

String.module_eval do
  def something
  puts "instance method something"
  end
end

"foo".something

This syntax is more or less equivalent to using the module or class keyword, but the difference is that you can send in a block which gives you some more flexibility. For example, say that you want to define the same method on three different classes. The idiomatic way of doing it would be to define a new module and include that in all the classes. But another alternative would be doing it like this:

block = proc do
  def something
  puts "Shared something definition"
  end
end

String.module_eval &block
Hash.module_eval &block
Binding.module_eval &block

The method class_eval is an alias for module_eval - it does exactly the same thing.

OK, so now you know when the def method can be used. Some important notes about it to remember is this: def does _not_ use any enclosing scope. The method defined by def will not be a lexical closure, which means that you can only use instance variables from the enclosing running environment, and even those will be the instance variables of the object executing the method, not the object defining the method. My main rule is this: use def whenever you can. If you don't need lexical closures or a dynamically defined name, def should be your default option. The reason: performance. All the other versions are much harder - and in some cases impossible - for the runtimes to improve. In JRuby, using def instead of define_method will give you a large performance boost. The difference isn't that large with MRI, but that is because MRI doesn't really optimize the performance of general def either, so you get bad performance for both.

Use def unless you can't.

The next version is define_method. It's just a regular method that takes a block that defines that implementation of the method. There are some drawbacks to using define_method - the largest is probably that the defined method can't use blocks, although this is fixed in 1.9. Define_method gives you two important benefits, though. You can use a name that you only know at runtime, and since the method definition is a block this means that it's a closure. That means you can do something like this:

class Obj
def something
 puts "calling simple"
 abc = 3*42
 (class <<self; self; end).send :define_method, :something do
   puts "calling memoized"
   abc
 end
 something
end
end

o = Obj.new
o.something
o.something
o.something

OK, let this code sample sink in for a while. It's actually several things rolled into one. They are all necessary though. First, note that abc is no longer an instance variable. It's instead a local variable to the first "something" method. Secondly, the funky looking thing(class <<self; self; end) is the easiest way to get the metaclass of the current object. Unlike def, define_method will not implicitly define something on the metaclass if you don't specify where to put it. Instead you need to do it manually, so the syntax to get the metaclass is necessary. Third, define_method happens to be a private method on Module, so we need to use send to get around this. But wait, why don't we just open up the metaclass and call define_method inside of that? Like this:

class Obj
def something
 puts "calling simple"
 abc = 3*42
 class << self
   define_method :something do
     puts "calling memoized"
     abc
   end
 end
 something
end
end

o = Obj.new
o.something
o.something
o.something

Well, it's a good thought. The problem is that it won't work. See, there are a few keywords in Ruby that kills lexical closure. The class, module and def keywords are the most obvious ones. So, the reference to abc inside of the define_method block will actually not be a lexical closure to the abc defined outside, but instead actually cause a runtime error since there is no such local variable in scope. This means that using define_method in this way is a bit cumbersome in places, but there are situations where you really need it.

The second feature of define_method is less interesting - it allows you to have any name for the method you define, including something random you come up with at runtime. This can be useful too, of course.

Let's summarize. The method define_method is a private method so it's a bit problematic to call, but it allows you to define methods that are real closures, thus providing some needed functionality. You can use whatever name you want for the method, but this shouldn't be the deciding reason to use it.

There are two problems with define_method. The first one is performance. It's extremely hard to generally optimize the performance of invocation of a define_method method. Specifically, define_method invocations will usually be a bit slower than activating a block, since define_method also needs to change the self for the block in question. Since it's a closure it is harder to optimize for other reasons too, namely we can never be exactly sure about what local variables are referred to inside of the block. We can of course guess and hope and do optimistic improvements based on that, but you can never get define_method invocations are fast as invoking a regular Ruby method.

Since the block sent to define_method is a closure, it means it might be a potential memory leak, as I documented in an older blog post. It's important to note that most Ruby implementations keep around the original self of the block definition, as well as the lexical context, even though the original self is never accessible inside the block, and thus shouldn't be part of the closed environment. Basically, this means that methods defined with define_method could potentially leak much more than you'd expect.

The final way of defining a method dynamically in Ruby is using def or define_method inside of an eval. There are actually interesting reasons for doing both. In the first case, doing a def inside of an eval allows you to dynamically determine the name of the method, it allows you to insert any code before or after the actual functioning code, and most importantly, defining a method with def inside of eval will usually have all the same performance characteristics as a regular def method. This applies for invocation of the method, not definition of it. Obviously eval is slower than just using def directly. The reason that def inside of an eval can be made fast is that at runtime it will be represented in exactly the same way as a regular def-method. There is no real difference as far as the Ruby runtime sees it. In fact, if you want to, you can model the whole Ruby file as running inside of an eval. Not much difference there. In particular, JRuby will JIT compile the method if it's defined like that. And actually, this is exactly how Rails handles potentially slow code that needs to be dynamically defined. Take a look at the rendering of compiled views in ActionPack, or the route recognition. Both of these places uses this trick, for good reasons.

The other one I haven't actually seen, and to be fair I just made it up. =) That's using define_method inside of an eval. The one thing you would gain from doing such a thing is that you have perfect control over the closure inside of the method defined. That means you could do something like this:

class BinderCreator
def get
  abc = 123
  binding
end
end

eval(<<EV, BinderCreator.new.get)
  Object.send :define_method, :something do
  abc
end
EV

In this code we create a new method "something" on Object. This method is actually a closure, but it's an extremely controller closure since we create a specific binding where we want it, and then use that binding as the context in which the define_method runs. That means we can return the value of abc from inside of the block. This solution will have the same performance problems as regular define_method methods, but it will let you control how much you close over at least.

So what's the lesson? Defining methods can be complicated in Ruby, and you absolutely need to know when to use which one of these variations. Try to avoid define_method unless you absolutely have to, and remember that def is available in more places than you might think.

lördag, maj 03, 2008

The Twitter Conspiracy

First, let me warn you. This is a conspiracy theory. It's got all the usual logical fallacies and problems of a conspiracy theory. Add to that the fact that it's almost guaranteed to not be true, you might ask why I'm writing about it. Well, the thing is, if I was in charge of Twitter, I would do something like this. Of course, I'm not in charge, and the people in charge are probably more sane and well adjusted persons than me. But still, who knows?

What is the idea then? Well, it's actually really simple. Think about the Twitter network, the kind of people who connect there, and the way things spread. What is the difference between Twitter and mobile texting for example? First, everything is by default multicast. It's not reciprocal - you don't need to know how many hundreds or thousands read what you write. And you don't care how many others read the persons you read. You are restricted in length. And, the whole thing is open enough that you can follow all the tweets going on in the system.

The characteristics I've described means that Twitter is more or less the ideal memetic engine. What I mean by this is that it's a wonderful way to spread your ideas, if you can express them in a concise and readable way. This means that certain memes doesn't work well in this setting, but most do. And you can convince more people to join, because if your tweets are interesting enough, someone will notice them in the all-tweet. Also, you can see who the people you are interested in follows, which means that you can spread your network selectively, but really quickly.

These are not really part of the theory. They are just the axioms. So what's the theory then? Well, what are Twitter doing with all this data? If I would have been them, I would have used it to do research on memetic spread and viral marketing. I would use it to try out ideas based on how good uptake they have. Finding this information is not really hard when you have control over all the messages happening. In fact, you could actually do it even outside of Twitter, by using the published tools correctly.

What got me thinking along these lines? Well, the whole TechCrunch debacle was the thing that triggered the idea. How would it work in practice? Well, first, the Twitter gang couldn't necessarily know what kind of people would take up Twitter the most, so the cultural fit of Twitter is actually mostly self organizing. The people and groups taking part of twitter select themself for this experiment. Now, of course there are lots of overlapping groups, and that gives even more interesting possibilities for the sociodynamics of meme transfer between non-overlapping social circles.

Take the Ruby people, who have a quite significant presence on Twitter. They are one of the test groups in my theory, and the TechCrunch article was a very directed way of inserting a meme and see how fast and to how many it propagated. It was very easy to insert this into the blog-sphere, since Twitter could have had any amount of people "leak" the information in the TechCrunch article. Once it was out, they just needed to set up some suitable filters and follow the spread. They also inserted a couonter meme, through one of their employees, to see if it work out as an "antidote" to the first meme, or which one of them was stronger. All in all, I think they got enough material for several research articles out of this stunt.

OK, so really, you don't need to grasp for conspiracy theories to explain the TC debacle. It's not necessary, so Occam's razor demands that we choose the simplest available conclusion that explains all the facts. This theory does not fall into this category. But it's still an entertaining notion.

And I predict that sooner or later, someone will use Twitter, or another network like it, to do this kind of research. It's a question of time. This kind of information is way to valuable for marketing purposes and also for the understanding of the human mind, that it will happen. The question is: will you know about it, when you're participating in their research?

Let's not forget the lovely recursive interpretation that this blog post is a way of doing the same kind of research I've just described.

onsdag, april 23, 2008

RbYAML in Google Summer of Code

Great news for all Ruby implementations around. A project to bring RbYAML up-to-date and perform better has been accepted for Google Summer of Code. Long Sun is the name of the student, and me and Xue Yong Zhi will jointly mentor this effort.

In fact, I'm very excited about this news. RbYAML was an incredibly important piece of the puzzle to get JRuby to finally work with RubyGems, and that kickstarted our possibilities to start testing numerous other applications. I soon ported RbYAML to Java, and created the JvYAML and JvYAMLb projects, to get better efficiency. Sadly, this left RbYAML without any TLC. That changed a while back when Rubinius picked up the project to get their YAML support going, and now that Long Sun will work on it, hopefully we will finally get an extremely compliant and bug free YAML implementation for Ruby.

This will obviously benefit Rubinius, but it will also be very good for both JRuby and IronRuby. The work will be test-driven which means a more complete test suite will be built around YAML in Ruby.

If you're interested in following the project, it's now hosted at Google Code (due to problems with RubyForge from China) at http://code.google.com/p/rbyaml/. Long Sun will also blog about his progress here: http://rbyaml.blogspot.com/.

Exciting news indeed.

tisdag, april 22, 2008

Ruby Design meeting

Yesterday marked the first of the Ruby design meetings, where most of the Ruby implementers got together on IRC and started hashing out the solutions to several current concerns, and worked on getting more cooperation in the design of Ruby features.

This is slated to become a weekly meeting, and it's a huge deal. This will make the lives of all Ruby implementations much easier, and the meeting yesterday actually accomplished some very nice things.

There were representatives from JRuby, Rubinius and macruby present, and of course also Matz, Koichi, Nobu and Tanaka from the Ruby core team.

Some of the highlights was a decision to start working on a common API for MultiVM, initial acceptance to add the RubySpecs (coming from Rubinius originally) into the 1.8 and 1.9 build process, meaning that regression testing will be much better from now on, and also having most implementations using the same specs for compliance testing. In reality, this takes us one step closer to a real executable specification that everyone agrees on and has official blessing from Matz.

In conjunction with this, a decision to set up continuous integration for 1.8 and 1.9 was made. The exact practicalities is still to be decided, but the decision to get it done is also very important.

All in all, these are excellent news, and I'm feeling extremely hopeful about more cooperation between the Ruby implementors.

If you're interested in exactly what happened, you can find the agenda, action items and log here: http://ruby-design.pbwiki.com/Design20080421.

tisdag, april 15, 2008

Connecting languages (or polyglot programming example 1)

Today I spent some time connecting two languages that are finding themselves popular for solving wildly different kinds of problems. I decided I wanted to see how easy it was and if it was a workable solution if you would want to take advantage of the strengths of both languages. The result is really up to you. My 15 minutes experiment is what I'll discuss here.

If you'd like, you can see this as a practical example of the sticky part where two languages meet, in language-oriented programming.

The languages under consideration is Ruby and Erlang. The prerequisite reading is this eminent article by my colleague Dennis Byrne: Integrating Java and Erlang.

The only important part is in fact the mathserver.erl code, which you can see here:

-module(mathserver).
-export([start/0, add/2]).

start() ->
Pid = spawn(fun() -> loop() end),
register(mathserver, Pid).

loop() ->
receive
  {From, {add, First, Second}} ->
      From ! {mathserver, First + Second},
      loop()
end.

add(First, Second) ->
mathserver ! {self(), {add, First, Second}},
receive
  {mathserver, Reply} -> Reply
end.

Follow Dennis' instructions to compile this code and start the server in an Erlang console, and then leave it there.

Now, to use this service is really easy from Erlang. You can really just use the mathserver:add/2 operation directly or remotely. But doing it from another language, in this case Ruby is a little bit more complicated. I will make use of JRuby to solve the problem.

So, the client file for using this code will look like this:

require 'erlang'

Erlang::client("clientnode", "cookie") do |client_node|
server_node = Erlang::OtpPeer.new("servernode@127.0.0.1")
connection = client_node.connect(server_node)

connection.sendRPC("mathserver", "add", Erlang::list(Erlang::num(42), Erlang::num(1)))

sum = connection.receiveRPC

p sum.int_value
end

OK, I confess. There is no erlang.rb yet, so I made one. It includes some very small things that make the interfacing with erlang a bit easier. But it's actually still quite straight forward what's going on. We're creating a named node with a specific cookie, connecting to the server node, and then using sendRPC and receiveRPC to do the actual operation. The missing code for the erlang.rb file should look something like this (I did the minimal amount here):

require 'java'
require '/opt/local/lib/erlang/lib/jinterface/priv/OtpErlang.jar'

module Erlang
import com.ericsson.otp.erlang.OtpSelf
import com.ericsson.otp.erlang.OtpPeer
import com.ericsson.otp.erlang.OtpErlangLong
import com.ericsson.otp.erlang.OtpErlangObject
import com.ericsson.otp.erlang.OtpErlangList
import com.ericsson.otp.erlang.OtpErlangTuple

class << self
def tuple(*args)
  OtpErlangTuple.new(args.to_java(OtpErlangObject))
end

def list(*args)
  OtpErlangList.new(args.to_java(OtpErlangObject))
end

def client(name, cookie)
  yield OtpSelf.new(name, cookie)
end

def num(value)
  OtpErlangLong.new(value)
end

def server(name, cookie)
  server = OtpSelf.new(name, cookie)
  server.publish_port

  while true
    yield server, server.accept
  end
end
end
end

As you can see, this is regular simple code to interface with a Java library. Note that you need to find where JInterface is located in your Erlang installation and point to that (and if you're on MacOS X, the JInterface that comes with ports doesn't work. Download and build a new one instead).

There are many things I could have done to make the api MUCH easier to use. For example, I might add some methods to OtpErlangPid, so you could do something like:

pid << [:call, :mathserver, :add, [1, 2]]

where the left arrows sends a message after transforming the arguments.

In fact, it would be exceedingly simple to make the JInterface API downright nice to use, getting the goodies of Erlang while retaining the Ruby language. And oh yeah, this could work on MRI too. There is an equivalent C library for interacting with Erlang, and there could either be a native extension for doing this, or you could just wire it up with DL.

If you read the erlang.rb code carefully, you might have noticed that there are several methods not in use currently. Say, why are they there?

Well, it just so happens that we don't actually have to use any Erlang code in this example at all. We could just use the Erlang runtime system as a large messaging bus (with fault tolerance and error handling and all that jazz of course). Which means we can create a server too:

require 'erlang'

Erlang::server("servernode", "cookie") do |server, connection|
 terms = connection.receive
 arguments = terms.element_at(1).element_at(3)
 first = arguments.element_at(0)
 second = arguments.element_at(1)

 sum = first.long_value + second.long_value
 connection.send(connection.peer.node, Erlang::tuple(server.pid, Erlang::num(sum)))
end

The way I created the server method, it will accept connections and invoke the block for every time it accepts a connection. This connection is yielded to the block together with the actual node object representing the server. The reason the terms are a little bit convoluted is because the sendRPC call actually adds some things that we can just ignore in this case. But if we wanted, we could check the first atoms and do different operations based on these.

You can run the above code in server, and use the exact same math code if you want. For ease of testing, switch the name to servernode2 in both server and client, and then run them. You have just sent Erlang messages from Ruby to Ruby, passing along Java on the way.

Getting different languages working together doesn't need to be hard at all. In fact, it can be downright easy to switch to another language for a few operations that doesn't suit the current language that well. Try it out. You might be surprised.

lördag, april 12, 2008

Pragmatic Static Typing

I have been involved in several discussions about programming languages lately and people have assumed that since I spend lots of time in the Ruby world and with dynamic languages in general I don't like static typing. Many people in the dynamic language communities definitely expresses opinions that sound like they dislike static typing quite a lot.

I'm trying to not sound defensive here, but I feel the need to clarify my position on the whole discussion. Partly because I think that people are being extremely dogmatic and short-sighted by having an attitude like that.

Most of my time I spend coding in Java and in Ruby. My personal preference are to languages such as Common Lisp and Io, but there is no real chance to use them in my day-to-day work. Ruby neatly fits the purpose of a dynamic language that is close to Lisp for my taste. And I'm involved in JRuby because I believe that there is great worth in the Java platform, but also that many Java programmers would benefit from staying less in the Java language.

I have done my time with Haskell, ML, Scala and several other quite statically typed languages. In general, the fact that I don't speak that much about those languages is that I have less exposure to them in my day-to-day life.

But this is the thing. I don't dislike static typing. Absolutely not. It's extremely useful. In many circumstances it gives me things that really can't have in a dynamic language.

Interesting thought: Smalltalk is generally called a dynamic language with very late binding. There are no static type tags and no type inference happening. The only type checking that happens will happen at runtime. In this regard, Smalltalk is exactly like Ruby. The main difference is that when you're working with Smalltalk, it is _always_ runtime. Because of the image based system, the type checking actually happens when you do the programming. There is no real difference between coding time and runtime. Interestingly, this means that Smalltalk tools and environments have most of the same features as a static programming language, while still being dynamic. So a runtime based image system in a dynamic, late bound programming language will actually give you many of the benefits of static typing at compile time.

So the main take away is that I really believe that static typing is extremely important. It's very, very useful, but not in all circumstances. The fact that we reach for a statically typed programming language by default is something we really need to work with though, because it's not always the right choice. I'm going to say this in even stronger words. In most cases a statically typed language is a premature optimization that gets extremely much in the way of productivity. That doesn't mean you shouldn't choose a statically typed language when it's the best solution for the problem. But this should be a conscious choice and not by fiat, just because Java is one of the dominant languages right now. And if you need a statically typed language, make sure that you choose one that doesn't revel in unnecessary type tags. (Java, C#, C++, I'm looking at you.) My current choice for a static language is Scala - it strikes a good balance in most cases.

A statically typed language with type inference will give you some of the same benefits as a good dynamic language, but definitely not all of them. In particular, you get different benefits and a larger degree of flexibility from a dynamic language that can't be achieved in a static language. Neal Ford and others have been talking about the distinction between dynamic and static typing as being incorrect. The real question is between essence and ceremony. Java is a ceremonious language because it needs you to do several dances to the rain gods to declare even the simplest form of method. In an essential language you will say what you need to say, but nothing else. This is one of the reasons dynamic languages and type inferenced static languages sometimes look quite alike - it's the absence of ceremony that people react to. That doesn't mean any essential language can be replaced by another. And with regads to ceremony - don't use a ceremonious language at all. Please. There is no reason and there are many alternatives that are better.

My three level architecture of layers should probably be updated to say that the stable layer should be an essential, statically typed language. The soft/dynamic layer should almost always be a strongly typed dynamic essential language, and the DSL layers stays the same.

Basically, I believe that you should be extremely pragmatic with regards to both static and dynamic typing. They are tools that solve different problems. But out industry today have a tendency to be very dogmatic about these issues, and that's the real danger I think. I'm happy to see language-oriented programming and polyglot programming get more traction, because they improve a programmers pragmatic sensibilities.

torsdag, mars 20, 2008

The contract of IO#read

It's interesting. After Charlie made an immense effort and rewrote our IO system, basically from scratch, I have started to find bugs. But these are generally not bugs in the IO code, but bugs in Ruby libraries that depend on the way MRI usually works. One of the more annoying ones are IO#read(n), where n is the length you want to read.

This method is not guaranteed to return a string of length n, even if we haven't hit EOF yet. You can NEVER be sure that what you get back is the length you requested. Ever. If you have code that doesn't check the length of the returned string from read, you are almost guaranteed to have a bug just waiting to happen.

Of course, it might work perfectly on your machine and every other machine you test it on. The reason for this is that read(n) will usually return n bytes, but that depends on the socket implementation or file reading implementation of the operating system, it depends on the size of the cache in the network interface, it depends on network latency, and many other things. Please, just make sure to check the return values length before going ahead and using it.

Case in point: net/ldap has this exact problem. Look in net/ber.rb and you will see. There are also two - possibly three - bugs (couple of years old too) that reports different failures because of this.

One thing that makes this problem hard to find is the fact that if you insert debug statement, you will affect the timing in such a way that the code might actually work with debug statement but not without them.

Oh, did I mention that StringIO#read works the same way? It has exactly the same guarantees as IO#read.

måndag, februari 25, 2008

Language design philosophy: more than one way?

When talking about how dynamic scripting languages are designed, people have a tendency to divide them into "There is more than one way to do it" and "There is one way to do it". Perl is the quintessential example of "more than one way", and Python is the opposite, going so far as to make it impossible to have your own indentation.

These different ways of doing things really divide programmers. Some hate the way Python gives you a bondage strait jacket and no scissors, while some programmers love that they don't need to make any choices and that everyone's code will be equally readable.

From a pure perspective, the Python philosophy seems to be the right one, but it just doesn't work for me. In the same way, I agree with Perl's way of doing things, but the problems that cause in most Perl code is just amazing. Sometimes it feels like the Python way is actually directly a result of someone reacting extremely badly to a Perl code base and decided to never allow that to happen in Python.

So what's the point? Well, the point is Ruby. In fact, Ruby has almost all of the flexibility of Perl to do things in different ways. But at the end of the day, none of the Perl problems tend to show up. Why is this? And why do I feel so comfortable in Ruby's "There is more than one way to do it" philosophy, while the Perl one scares me?

I think it comes down to language design. The Python approach is impossible for the simple reason that what the language designer chooses is going to be the "one way", by fiat. Some people will agree, and some will not. But what I'm seeing in Ruby is that the many ways have been transformed into idioms and guidelines. There are no hard rules, but the community have evolutionary evolved idioms that work and found out many of the ways that doesn't work. This seems to be the right way - if you do the choice as a language designer, you have actually chosen the people that will use your language: that's going to be the persons who doesn't dislike the language designers choices. But if you leave it open enough for evolutionary community design to happen you can actually get the best of both world: both a best way to do things, and something that works for a much larger percentage of the programmer world.

I have come to believe that this is one of the major reasons that Ruby feels so good to me, and is such a good language. And it's a lesson for language designers. When you choose philosophy, make sure to take the possibility of communities and idiom evolution into account.

fredag, januari 18, 2008

Ruby antipattern: Using eval without positioning information

I have noticed that the default way eval, instance_eval(String) and module_eval(String) almost never does what you want unless you supply positioning information. Oh, it will execute all right, and provided you have no bugs in your code, everything is fine. But sooner or later you or someone else will need to debug the code in that eval. In those cases it's highly annoying when the full error message is "(eval):1: redefining constant Foo". Sure, in most cases you can still get all the information. But why not just make sure that everything needed is there to trace the call?

I would recommend changing all places you have eval, or where using instance_eval or module_eval that takes a string argument, into the version that takes a file and line number:

eval("puts 'hello world'")
# becomes
eval("puts 'hello world'", binding, __FILE__, __LINE__)

"str".instance_eval("puts self")
# becomes
"str".instance_eval("puts self", __FILE__, __LINE__)

String.module_eval("A=1")
# becomes
String.module_eval("A=1", __FILE__, __LINE__)

torsdag, januari 03, 2008

Scala testing with specs

So. The story about unit testing is over for now. I will use specs. Eric made an incredible job and got the JUnit support working with JUnit 4 too very quickly. Today I reintegrated it, and everything works fine.

So, if you want working Ant integration for your Scala testing, I recommend specs.

Of course, this still doesn't explain while all the other alternatives failed so miserably. Hopefully there will be a bit more testing in the community soon. Testing frameworks need competition to evolve well.

It's funny. One of the points to my original post about Scala unit testing was this quote: "Now some lovely Ruby people are looking at Scala, and the very first thing they must do (of course) is write the sacred unit tests:".

I'm not sure about you, but I would say that that's a good statement about Ruby people in general, if that's the way people view us. =)

fredag, december 28, 2007

JtestR 0.1 released

If people have wondered, this is what I have been working on in my spare time the last few weeks. But now it's finally released! The first version of JtestR.

So what is it? A library that allows you to easily test your Java code with Ruby libraries.

Homepage: http://jtestr.codehaus.org
Download: http://dist.codehaus.org/jtestr

JtestR 0.1 is the first public release of the JtestR testing tool. JtestR integrates JRuby with several Ruby frameworks to allow painless testing of Java code, using RSpec, Test/Unit, dust and Mocha.

Features:

Integrates with Ant and Maven
Includes JRuby 1.1, Test/Unit, RSpec, dust, Mocha and ActiveSupport
Customizes Mocha so that mocking of any Java class is possible
Background testing server for quick startup of tests
Automatically runs your JUnit codebase as part of the build

Getting started: http://jtestr.codehaus.org/Getting+Started

Team:
Ola Bini - ola.bini@gmail.com
Anda Abramovici - anda.abramovici@gmail.com

måndag, december 24, 2007

Code size and dynamic languages

I've had a fun time the last week noting the reactions to Steve Yegge's latest post (Code's Worst Enemy). Now, Yegge always manages to write stuff that generate interesting - and in some cases insane - comments. This time, the results are actually quite a bit more aligned. I'm seeing several trends, the largest being that having generated a 500K LOC code base in the first case is a sin against mankind. The second one being that you should never have one code base that's so large, it should be modularized into several hundreds of smaller projects/modules. The third reaction is that Yegge should be using Scala for the rewrite.

Now, from my perspective I don't really care that he managed to generate that large of a code base. I think any programmer could fall down the same tar pit, especially if it's over a large amount of time. Secondly, you don't need to be one programmer to get this problem. I would wager that there are millions of heinous code bases like this, all over the place. So my reaction is rather the pragmatic one: how do you actually handle the situation if you find yourself in it? Provided you understand the whole project and have the time to rewrite it, how should it be done? The first step in my opinion, would probably be to not do it alone. The second step would be to do it in small steps, replacing small parts of the system while writing unit tests while going.

But at the end of the day, maybe a totally new approach is needed. So that's where Yegge chooses to go with Rhino for implementation language. Now, if I would have tackled the same problem, I would never reimplement the whole application in Rhino - rather, it would be more interesting to try to find the obvious place where the system needs to be dynamic and split it there, keep those parts in Java and then implement the new functionality on top of the stable Java layer. Emacs comes to mind as a typical example, where the base parts are implemented in C, but most of the actual functionality is implemented in Emacs Lisp.

The choice of language is something that Stevey gets a lot of comments about. People just can't seem to understand why it has to be a dynamic language. (This is another rant, but people who comment on Stevey's blog seems to have a real hard time distinguishing between static typing and strong typing. Interesting that.) So, one reason is obviously that Stevey prefers dynamic typing. Another is that hotswapping code is one of those intrinsic features of dynamic languages that are really useful, especially in a game. The compilation stage just gets in the way at that level, especially if we're talking something that's going to live for a long time, and hopefully not have any down time. I understand why Scala doesn't cut it in this case. As good as Scala is, it's good exactly because it has a fair amount of static features. These are things that are extremely nice for certain applications, but it doesn't fit the top level of a system that needs to be malleable. In fact, I'm getting more and more certain that Scala needs to replace Java, as the semi stable layer beneath a dynamic language, but that's yet another rant. At the end of it, something like Java needs to be there - so why not make that thing be a better Java?

I didn't see too many comments about Stevey's ideas about refactoring and design patterns. Now, refactoring is a highly useful technique in dynamic languages too. And I believe Stevey is wrong saying that refactorings almost always increase the code size. The standard refactorings tend to cause that in a language like Java, but that's more because of the language. Refactoring in itself is really just a systematic way of making small, safe changes to a code base. The end result of refactoring is usually a cleaner code base, better understanding of that code base, and easier code to read. As such, they are as applicable to dynamic languages as to static ones.

Design patterns are another matter. I believe they serve two purposes - the first and more important being communication. Patterns make it easier to to understand and communicate high level features of a code base. But the second purpose is to make up for deficiencies in the language, and that's mostly what people see when talking about design patterns. When you're moving in a language like Lisp, where most design patterns are already in the language, you tend to not need them for communication as much either. Since the language itself provides ways of creating new abstractions, you can use those directly, instead of using design patterns to create "artificial abstractions".

As a typical example of a case where a design pattern is totally invisible due to language design, take a look at the Factory. Now, Ruby has factories. In fact, they are all over the place. Lets take a very typical example. The Class.new method that you use to create new instances of a class. New is just a factory method. In fact, you can reimplement new yourself:

class Class
def new(*args)
  object = self.allocate
  object.send :initialize, *args
  object
end
end

You could drop this code into any Ruby project, and everything would continue to work like before. That's because the new-method is just a regular method. The behavior of it can be changed. You can create a custom new method that returns different objects based on something:

class Werewolf;end
class Wolf;end
class Man;end

class << Werewolf
 def new(*args)
   object = if $phase_of_the_moon == :full
              Wolf.allocate
            else
              Man.allocate
            end
   object.send :initialize, *args
   object
 end
end

$phase_of_the_moon = :half
p Werewolf.new

$phase_of_the_moon = :full
p Werewolf.new

Here, creating a new Werewolf will give you either an instance of Man or Wolf depending on the phase of the moon. So in this case we are actually creating and returning something from new that is not even sub classes of Werewolf. So new is just a factory method. Of course, the one lesson we should all take from Factory, is that if you can, you should name your things better than "new". And since there is no difference between new and other methods in Ruby, you should definitely make sure that creating objects uses the right name.

fredag, december 21, 2007

Ruby closures and memory usage

You might have seen the trend - I've been spending time looking at memory usage in situations with larger applications. Specifically the things I've been looking at is mostly about deployments where a large number of JRuby runtimes is needed - but don't let that scare you. This information is exactly as applicable for regular Ruby as for JRuby.

One of the things that can really cause unintended high memory usage in Ruby programs is long lived blocks that close over things you might not intend. Remember, a closure actually has to close over all local variables, the surrounding blocks and also the living self at that moment.

Say that you have an object of some kind that has a method that returns a Proc. This proc will get saved somewhere and live for a long time - maybe even becoming a method with define_method:

class Factory
def create_something
 proc { puts "Hello World" }
end
end

block = Factory.new.create_something

Notice that this block doesn't even care about the actual environment it's created in. But as long as the variable block is still live, or something else points to the same Proc instance, the Factory instance will also stay alive. Think about a situation where you have an ActiveRecord instance of some kind that returns a Proc. Not an uncommon situation in medium to large applications. But the side effect will be that all the instance variables (and ActiveRecord objects usually have a few) and local variables will never disappear. No matter what you do in the block. Now, as I see it, there are really three different kinds of blocks in Ruby code:

Blocks that process something without needing access to variables outside. (Stuff like [1,2,3,4,5].select {|n| n%2 == 0} doesn't need closure at all)
Blocks that process or does something based on living variables.
Blocks that need to change variables on the outside.

What's interesting is that 1 and 2 are much more common than 3. I would imagine that this is because number 3 is really bad design in many cases. There are situations where it's really useful, but you can get really far with the first two alternatives.

So, if you're seeing yourself using long lived blocks that might leak memory, consider isolating the creation of them in as small of a scope as possible. The best way to do that is something like this:

o = Object.new
class << o
def create_something
  proc { puts "Hello World" }
end
end
block = o.create_something

Obviously, this is overkill if you don't know that the block needs to be long lived and it will capture things it shouldn't. The way it works is simple - just define a new clean Object instance, define a singleton method in that instance, and use that singleton method to create the block. The only things that will be captured will be the "o" instance. Since "o" doesn't have any instance variables that's fine, and the only local variables captured will be the one in the scope of the create_something method - which in this case doesn't have any.

Of course, if you actually need values from the outside, you can be selective and onle scope the values you actually need - unless you have to change them, of course:

o = Object.new
class << o
 def create_something(v, v2)
   proc { puts "#{v} #{v2}" }
 end
end
v = "hello"
v2 = "world"
v3 = "foobar" #will not be captured by the block
block = o.create_something(v, v2)

In this case, only "v" and "v2" will be available to the block, through the usage of regular method arguments.

This way of defining blocks is a bit heavy weight, but absolutely necessary in some cases. It's also the best way to get a blank slate binding, if you need that. Actually, to get a blank slate, you also need to remove all the Object methods from the "o" instance, and ActiveSupport have a library for blank slates. But this is the idea behind it.

It might seem stupid to care about memory at all in these days, but higher memory usage is one of the prices we pay for higher language abstractions. It's wasteful to take it too far though.