I'm using Antlr to lex a language. The language is fixed and has some cumbersome features. One in particular is being really annoying and giving me some trouble to handle neatly with Antlr 3.
This problem is about sorting out Identifiers. Now, to make things really, really simple, an identifier can consist of the letter "s" and the character ":" in any order, in any quantity. An identifier can also be the three operators "=", ":=" and "::=". That is the whole language. It's really easy to handle with whitespace separation and so on. But these are the requirements that give me trouble. The first three are simple baseline examples:
- "s" should lex into "s"
- "s:" should lex into "s:"
- "s::::" should lex into "s::::"
- "s:=" should lex into "s" and ":="
- "s::=" should lex into "s:" and ":="
- etc.
Anyone care to help? =)
9 kommentarer:
That seems to be a problem that straddles the lexical level and the syntax level. One needs a non-deterministic branch with backtracking to handle that, although I wonder at the clarity of a language that allows trailing colons in identifiers. Is a plain '=' also a valid token? If so, then the language is even more deeply flawed.
Sean: Yes, "=" is also a valid token. And yes, I realize that this syntax is deeply flawed. But I'm need this to be compatible with the original language source code. As I said in the blog post, I really can't change this. So I was wondering about advice to handle the situation with Antlr3.
I wish I had a good answer, but I'm unable to understand how you can tell the last two examples you gave apart from the other valid things they represent.
Does context matter? If so, can you make rules in ANTLR that include that context?
My assumption is that it doesn't, but still, in case you missed it I thought I'd throw it out there.
Like others, I'm not sure that what you describe is really lexable.
But I think I have a stupid solution that may work (depending on your needs): reverse the string. Then your string 's:=' will be reversed to '=:s' and easily lexed as '=:', 's'.
I might be misunderstanding, but shouldn't "s::=" lex into "s" and "::="?
If I remember correctly, increasing the lookahead amount on a scanner solves these kinds of issues at the expense of a slower lexer.
Something like:
options {
k=2;
}
The larger the k, the slower the lexer runs. But given the language you're describing, I would assume that speed isn't a critical issue :-)
But it's been a few years since I have used ANTLR, so I could be missing something.
Why would your predicate not work for the 5th case?
@ Bill: ANTLR 3 has a nifty LL(*) parser lookahead to roam arbitrarily far ahead looking for a token or sequence that disambiguates a decision. Works fine as long as the rule isn't recursive and I think this one could be done.
@anonym1: well, I thought about that, but it doesn't work when you're lexing in a full program file.
@anonym2: no, that's one of the complications. "s::=" should become "s:" and ":=", since both of these have a kind of higher precedence than "::=".
Bill Six: well, Antlr 3 happens to b LL(x), meaning it uses as much lookahead as it needs. So in this case lookahead will not help. It would never be enough with a static lookahead in this case either, since the length of the identifier can change.
Josh: Uhm. The problem is what the next option should be. How do I specify a lexing string that says lex one character less than above?
Hi Ola, At first glance I would make two identifier patterns. One for the normal case and another for the last two with ":=" on the end then lop off two characters when it matches.
Skicka en kommentar