söndag, december 18, 2005

The power of YAML and XML.

Me and my colleagues have recently spent some time debating the readability and usability of YAML versus XML. As expected, the opinions vary wildly. Because of this, I've spent quite some time thinking about these issues. I really believe that the question is one about power, that readability, usability and succinctness are what defines the power of a data format. This discussion is restricted to formats which are text based, human and computer readable, and are mostly used for configuration and data interchange. XML is of course used also for RPC and build scripts, among others, but these uses have been agreed to be a nonoptimal, for many reasons.

So, I will use a simple example in three different encodings. The example is the common configuration of a data source for the use of a library. This is intentionally a simple problem, but it tends to show the main differences between encodings.
For this setup - without using binary or compressed formats - the simplest encoding is probably a line based, character separated value-file like this (where hash is a comment line):
(This example wraps in stupid way. Just imagine it's only two lines.)

#name, type, database, server, uid, pwd, param1, param2, param3, param4, param5
development, postgres, my_dev, db.dontexist.com, mydev, secretpwd, encoding=LATIN1,,,,

The problem with this approach is apparent; there is no easy way of knowing which field means what. You need a comment just for the sake of the human reader to provide this information. If you for some reason write the password and username in the wrong order, the dataformat will not catch this. Another very real trouble is that as soon as you have to specify information specific to one instance of a database you have to use a new language inside the field. (In this case the small language is only defined as name=value, but it's still another concept to get used to, which is not part of the regular data format). In regard to this, it's not possible to define lists or maps without defining extra syntax for it.
The positive point with this encoding is that it's probably as short as it can be, without being binary or compressed. It's very easy to parse, but it's not generic at all.

Example number 2 is a data source definition from JBoss:


It's longer, of course, but it's still a very simple XML-document. No namespaces, no entities, no DTD's, no attributes. It's just very basic XML. And sure, it's more readable than the custom format. And it is generic, in the sense that you can define different formats and have a validating parser read it for you. But is it easy to read? I don't think so, there's to much noise. Sure, the element end tags make it easy seeing what ends when, but that's not a big deal anyway, since most XML documents actually use indentation to make it easy enough for a human being to read. Compare the previous document with this:

which is the exact same document, but well. I wouldn't edit it without a good editor to check for validity. Especially if it's on a production machine.
As I've said, it's good for reading, if it's your first time with the data format, but after a while it gets quite annoying trying to sort out the relevant information from all the end tags. Another problem is how to add simple non standard parameters. For example, if I wanted to add an Oracle-specific parameter, I would in this example have to change the JDBC connection string. This is not the fault of XML, of course. The other route would have been to add some element like this
<param name="encoding">LATIN1</param>
to the XML schema. But this denies the use of end tags to see where attribute mappings end. If we have long blobs of information stored in these params, and if there is more than one we can't see from the ending which one we're at. So that point is lost here.
Another thing which sometimes is a problem is the lack of simple mappings and lists as datatypes in XML. Sure, you can define them in your own schema, or remap CDATA-sections to your own format, but this is not part of the spec, and will never be readable by a standard XML-parser without help from you.
You will never be able to do something like this in DOM: ((List)node.getNodeValue()).iterator() since nodeValue is defined to be a String and there is no intrinsic sequence in XML values.

Lastly, YAML. This example is taken from Rails:

adapter: postgresql
database: xyz_dev
host: db.dontexist.com
username: railsdev
password: secretpwd
encoding: LATIN1

Notice that indentation and newlines matter in YAML. Actually, it's this context sensitivity that
makes it extremely readable for human beings, but still being parseable by machines. One thing that is immediately noticeable is that the "encoding"-parameter - which is PostgreSQL specific - looks exactly the same as the other parameters. No no-YAML construct is used to represent this. The next interesting point is that even though you've never seen YAML before, you should be able to reason out that something that's called "development" have a few properties attached to it, with the values seen.
In reality, what happens in a standard YAML-parser reading this, is that a map will be created with one entry with key "development" and as value have one map instance with the keys and values specified. YAML have support for three datatypes; mapping, sequence and scalar. These types can be specialized to a specific implementation, which means that any object can be serialized as YAML without very much work. In the Ruby implementation every object gets a new method added, called to_yaml, which returns the YAML representation of that instance. There is a static method called YAML::load which returns the correct object for the serialized YAML stream sent in.

All three formats have much more advanced features, of course. You can do whatever you want with the specific format. XML have schema, validation, XPath and much other. YAML have tags, aliases and some more things. But when looking at all this from the perspective of what you can easily represent inside the language, without making new meta language constructs, and still retain readability and easy understandability, well. My vote is on YAML. I have seen wrong end tags in XML more times than I can ever count, and I still find it impractical to extract relevant information from convoluted documents. Have you ever tried reading a slightly complex WSDL-definition? Change it by hand? It isn't fun.

Of course, S-expressions is a good alternative, which have many of the nice points of YAML, while not having the dependency on whitespace for structure. But the whitespace is fun. I look at the lists and text files I've scattered all over my computer, filled with notes for myself, and I realize that most of it could be readable by YAML without a change. That is more or less how I write lists and mappings for myself, without ever intending it to be read by computers. I sure don't write notes for myself in XML.

I would never program in YAML, or make a turing complete language out of it, but for pure data representation that should be easily readable and writable by humans, while still being as succinct as possible, it's probably the best thing right now.
But would there ever be a need for using the data to drive programs, S-expressions is the way to go.

A last note, if you believe YAML to complex, try out JSON. It's actually a proper subset of YAML and should be read and writable by all YAML-compliant processors. The same thing goes for standard Java properties-files.

2 kommentarer:

Anonym sa...

Interesting website with a lot of resources and detailed explanations.

Anonym sa...

Your site is on top of my favourites - Great work I like it.