onsdag, januari 02, 2008

Antlr lexing problem

I should probably post this on a mailing list instead, but for now I want to document my problem here. If anyone has any good suggestions I'd appreciate it.

I'm using Antlr to lex a language. The language is fixed and has some cumbersome features. One in particular is being really annoying and giving me some trouble to handle neatly with Antlr 3.

This problem is about sorting out Identifiers. Now, to make things really, really simple, an identifier can consist of the letter "s" and the character ":" in any order, in any quantity. An identifier can also be the three operators "=", ":=" and "::=". That is the whole language. It's really easy to handle with whitespace separation and so on. But these are the requirements that give me trouble. The first three are simple baseline examples:
  • "s" should lex into "s"
  • "s:" should lex into "s:"
  • "s::::" should lex into "s::::"
  • "s:=" should lex into "s" and ":="
  • "s::=" should lex into "s:" and ":="
  • etc.
Now, the problem is obviously that any sane way of lexing this will end up eating the last colon too. I can of course use a semantic predicate to make sure this isn't allowed when the last character is a colon and the next is "=". This helps for the 4th case, but not for the 5th.

Anyone care to help? =)

9 kommentarer:

Anonym sa...

That seems to be a problem that straddles the lexical level and the syntax level. One needs a non-deterministic branch with backtracking to handle that, although I wonder at the clarity of a language that allows trailing colons in identifiers. Is a plain '=' also a valid token? If so, then the language is even more deeply flawed.

Ola Bini sa...

Sean: Yes, "=" is also a valid token. And yes, I realize that this syntax is deeply flawed. But I'm need this to be compatible with the original language source code. As I said in the blog post, I really can't change this. So I was wondering about advice to handle the situation with Antlr3.

Anonym sa...

I wish I had a good answer, but I'm unable to understand how you can tell the last two examples you gave apart from the other valid things they represent.

Does context matter? If so, can you make rules in ANTLR that include that context?

My assumption is that it doesn't, but still, in case you missed it I thought I'd throw it out there.

Anonym sa...

Like others, I'm not sure that what you describe is really lexable.
But I think I have a stupid solution that may work (depending on your needs): reverse the string. Then your string 's:=' will be reversed to '=:s' and easily lexed as '=:', 's'.

Anonym sa...

I might be misunderstanding, but shouldn't "s::=" lex into "s" and "::="?

Bill Six sa...

If I remember correctly, increasing the lookahead amount on a scanner solves these kinds of issues at the expense of a slower lexer.

Something like:

options {

The larger the k, the slower the lexer runs. But given the language you're describing, I would assume that speed isn't a critical issue :-)

But it's been a few years since I have used ANTLR, so I could be missing something.

Joshua Graham sa...

Why would your predicate not work for the 5th case?

@ Bill: ANTLR 3 has a nifty LL(*) parser lookahead to roam arbitrarily far ahead looking for a token or sequence that disambiguates a decision. Works fine as long as the rule isn't recursive and I think this one could be done.

Ola Bini sa...

@anonym1: well, I thought about that, but it doesn't work when you're lexing in a full program file.

@anonym2: no, that's one of the complications. "s::=" should become "s:" and ":=", since both of these have a kind of higher precedence than "::=".

Bill Six: well, Antlr 3 happens to b LL(x), meaning it uses as much lookahead as it needs. So in this case lookahead will not help. It would never be enough with a static lookahead in this case either, since the length of the identifier can change.

Josh: Uhm. The problem is what the next option should be. How do I specify a lexing string that says lex one character less than above?

Adrian Thurston sa...

Hi Ola, At first glance I would make two identifier patterns. One for the normal case and another for the last two with ":=" on the end then lop off two characters when it matches.