tisdag, september 19, 2006

YAML needs schema

It has been said before and it needs to be said again. YAML really needs schema. Now, before all your enterprisey warning bells start ringing I want to add that I'm only proposing this for specific applications. Most uses of YAML can continue gladly without any need for schema. But for some cases the security and validation capabilities of a good YAML schema would be invaluable. One example could be for RubyGems. It shouldn't be possible to crash RubyGems with bad YAML. Also, in all cases where Ruby emits objects as YAML it should be possible to automatically generate a schema specification from the object structure. This means that in many cases you may not need to create your schema by hand. You could just serialize your domain objects to YAML, take the schema generated and modify it as needed.

What would the advantages of YAML schema be? Numerous:
  • Validation: Validate that a YAML file conforms to your expectations before loading it
  • Default values: The possibility to provide default values for missing parts of the YAML, making convention over configuration even more powerful. With reasonable defaults most YAML documents could shrink dramatically in size.
  • Tool help: GUI builders and other tools would be able to help you construct your YAML-file from scratch. I like being able to auto-complete XML with nXML in Emacs. Very neat. I just wish I had that capability with yaml-mode too.
  • Loading hints and instructions: A schema could specify that the key named 'foo' always has a value with the tag !ruby/object:Gem::Specification or that all integer values should be decimal, regardless of leading zeroes. Many instructions that you at this point need to customize your YAML system to achieve would be automatic.
  • Remove clutter from YAML-file: If the schema defines the tags for values, it means that this information doesn't need to appear in the YAML file itself, reducing clutter and noise. This would make it even easier to edit YAML files by hand.
A YAML schema format should be specified in YAML, and it should be self hosting (meaning it's format language should be definable in itself). For most parts it seems we can use ideas from XML Schema. The only part I'm not really sure about for YAML schema is how to bind a document to a schema. Maybe the best way would just be to add a new directive that specifies the schema for that document. I don't believe that YAML needs different schema for different parts of documents right now, though. I don't think we need the proliferation of schema metadata inside the YAML document that XML experiences. (Anyone tried to manually work with a WSDL-file which includes all requires namespaces and such? Nightmare!)

There are a few different parts needed for this to work. I believe it could be done with the current YAML spec (and retrofitted on YAML 1.0 too), since the only real change to the document would be a new directive in the stream header. The next step is that someone starts defining a format for schema. Then, a tool would be needed that could validate against schema. This wouldn't reap us all benefits of schema, but it's a start. The final step would be to integrate schema support in existing YAML libraries, to allow validation and using schema for metadata information.

Actually, this solves exactly half the problem, the part of the problem I call the external validation. The other part is not YAML specific, and it's something I've been thinking about for Ruby. This regards validation of object hierarchies in the current language. Expect some more info on this in one or few days. I want to have something usable to release. But I believe the Ducktator will be really useful for certain use cases.

3 kommentarer:

Anonym sa...

If you want XML, use XML. It is inevitable that YAML should head down the road of adding more XML features, but we should be attempting to halt that advance. YAML's simplicity is the key feature of the language; every time you add something new, you lose a bit of that simplicity.

If you want to add schemas to YAML, you're losing simplicity to the point that you should really reconsider using XML for that particular application. YAML isn't a Victorinox, and neither is XML; they're invidual tools within the set.

Ola Bini sa...

You're right, to a degree. As you say, the simplicity is key. But for me, it's even more important with ways to have that simplicity without sacrificing data quality. I don't like that tradeoff.

I believe there is a way to do YAML schema in such a way that they are unobtrusive unless you need them. And when they are there, they should actually increase simplicity, since you won't need explicit type tags for most of your values.

Chris Rebert sa...

It's been done, albeit in a limited fashion:
Kwalify