开发者

How do I distinguish a very keyword-like token from a keyword using ANTLR?

开发者 https://www.devze.com 2023-03-24 22:25 出处:网络
I am having trouble distinguishing a keyword from a non-keyword when a grammar allows the non-keyword to have a similar \"look\" to the keyword.

I am having trouble distinguishing a keyword from a non-keyword when a grammar allows the non-keyword to have a similar "look" to the keyword.

Here's the grammar:

grammar Query;

options {
  output = AST;
  backtrack = true;
}
tokens {
  DefaultBooleanNode;
}

// Parser

startExpression : expression EOF ;

expression : withinExpression ;

withinExpression
  : defaultBooleanExpression
    (WSLASH^ NUMBER defaultBooleanExpression)*

defaultBooleanExpression
  : (queryFragment   -> queryFragment)
    (e=queryFragment -> ^(DefaultBooleanNode $defaultBooleanExpression $e))*
  ;

queryFragment : unquotedQuery ;

unquotedQuery : UNQUOTED | NUMBER ;

// Lexer

WSLASH    : ('W'|'w') '/';

NUMBER    : Digit+ ('.' Digit+)? ;

UNQUOTED : UnquotedStartChar UnquotedChar* ;

fragment UnquotedStartChar
  : EscapeSequence
  | ~( ' ' | '\r' | '\t' | '\u000C' | '\n' | '\\'
     | ':' | '"' | '/' | '(' | ')' | '[' | ']'
     | '{' | '}' | '-' | '+' | '~' | '&' | '|'
     | '!' | '^' | '?' | '*' )
  ;

fragment UnquotedChar
  : EscapeSequence
  | ~( ' ' | '\r' | '\t' | '\u000C' | '\n' | '\\'
     | ':' | '"' | '(' | ')' | '[' | ']' | '{'
     | '}' | '~' | '&' | '|' | '!' | '^' | '?'
     | '*' )
  ;

fragment EscapeSequence
  : '\\'
    ( 'u' HexDigit HexDigit HexDigit HexDigit
    | ~( 'u' )
    )
  ;

fragment Digit : ('0'..'9') ;
fragment HexDigit : ('0'..'9' | 'a'..'f开发者_开发百科' | 'A'..'F') ;

WHITESPACE : ( ' ' | '\r' | '\t' | '\u000C' | '\n' ) { skip(); };

I have simplified it enough to get rid of the distractions but I think removing any more would remove the problem.

  • A slash is permitted in the middle of an unquoted query fragment.
  • Boolean queries in particular have no required keyword.
  • A new syntax (e.g. W/3) is being introduced but I'm trying not to affect existing queries which happen to look similar (e.g. X/Y)
  • Due to '/' being valid as part of a word, ANTLR appears to be giving me "W/3" as a single token of type UNQUOTED instead of it being a WSLASH followed by a NUMBER.
  • Due to the above, I end up with a tree like: DefaultBooleanNode(DefaultBooleanNode(~first clause~, "W/3"), ~second clause~), whereas what I really wanted was WSLASH(~first clause~, "3", ~second clause~).

What I would like to do is somehow write the UNQUOTED rule as "what I have now, but not matching ~~~~", but I'm at a loss for how to do that.

I realise that I could spell it out in full, e.g.:

  • Any character from UnquotedStartChar except 'w', followed by the rest of the rule
  • 'w' followed by any character from UnquotedChar except '/', followed by the rest of the rule
  • 'w/' followed by any character from UnquotedChar except digits
  • ...

However, that would look awful. :)


When a lexer generated by ANTLR "sees" that certain input can be matched by more than 1 rule, it chooses the longest match. If you want a shorter match to take precedence, you'll need to merge all the similar rules into one and then check with a gated sematic predicate if the shorter match is ahead or not. If the shorter match is ahead, you change the type of the token.

A demo:

Query.g

grammar Query;

tokens {
  WSlash;
}

@lexer::members {
  private boolean ahead(String text) {
    for(int i = 0; i < text.length(); i++) {
      if(input.LA(i + 1) != text.charAt(i)) {
        return false;
      }
    }
    return true;
  }
}

parse
  :  (t=. {System.out.printf("\%-10s \%s\n", tokenNames[$t.type], $t.text);} )* EOF
  ;

NUMBER
  :  Digit+ ('.' Digit+)? 
  ;

UNQUOTED 
  :  {ahead("W/")}?=> 'W/' { $type=WSlash; /* change the type of the token */ }
  |  {ahead("w/")}?=> 'w/' { $type=WSlash; /* change the type of the token */ }
  |  UnquotedStartChar UnquotedChar* 
  ;

fragment UnquotedStartChar
  :  EscapeSequence
  |  ~( ' ' | '\r' | '\t' | '\u000C' | '\n' | '\\'
      | ':' | '"' | '/' | '(' | ')' | '[' | ']'
      | '{' | '}' | '-' | '+' | '~' | '&' | '|'
      | '!' | '^' | '?' | '*' )
  ;

fragment UnquotedChar
  : EscapeSequence
  | ~( ' ' | '\r' | '\t' | '\u000C' | '\n' | '\\'
     | ':' | '"' | '(' | ')' | '[' | ']' | '{'
     | '}' | '~' | '&' | '|' | '!' | '^' | '?'
     | '*' )
  ;

fragment EscapeSequence
  :  '\\'
     ( 'u' HexDigit HexDigit HexDigit HexDigit
     | ~'u'
     )
  ;

fragment Digit    : '0'..'9';
fragment HexDigit : '0'..'9' | 'a'..'f' | 'A'..'F';

WHITESPACE : (' ' | '\r' | '\t' | '\u000C' | '\n') { skip(); };

Main.java

import org.antlr.runtime.*;

public class Main {
  public static void main(String[] args) throws Exception {
    QueryLexer lexer = new QueryLexer(new ANTLRStringStream("P/3 W/3"));
    QueryParser parser = new QueryParser(new CommonTokenStream(lexer));
    parser.parse();
  }
}

To run the demo on *nix/MacOS:

java -cp antlr-3.3.jar org.antlr.Tool Query.g
javac -cp antlr-3.3.jar *.java
java -cp .:antlr-3.3.jar Main

or on Windows:

java -cp antlr-3.3.jar org.antlr.Tool Query.g
javac -cp antlr-3.3.jar *.java
java -cp .;antlr-3.3.jar Main

which will print the following:

UNQUOTED   P/3
WSlash     W/
NUMBER     3

EDIT

To eliminate the warning when using the WSlash token in a parser rule, simply add an empty fragment rule to your grammar:

 fragment WSlash : /* empty */ ;

It's a bit of a hack, but that's how it's done. No more warnings.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号