Scanner (Lexing keywords with ANTLR)_问答_开发者

I have been working on writing a scanner for my program开发者_C百科 and most of the tutorials online include a parser along with the scanner. It doesn't seem possible to write a lexer without writing a parser at the same time. I am only trying to generate tokens, not interpret them. I want to recognize INT tokens, float tokens, and some tokens like "begin" and "end"

I am confused about how to match keywords. I unsuccessfully tried the following:

KEYWORD : KEY1 | KEY2;

KEY1 : {input.LT(1).getText().equals("BEGIN")}? LETTER+ ;
KEY2 : {input.LT(1).getText().equals("END")}? LETTER+ ;

FLOATLITERAL_INTLITERAL
  : DIGIT+ 
  ( 
    { input.LA(2) != '.' }? => '.' DIGIT* { $type = FLOATLITERAL; }
    | { $type = INTLITERAL; }
  )
  | '.'  DIGIT+ {$type = FLOATLITERAL}
;

fragment LETTER : ('a'..'z' | 'A'..'Z');
fragment DIGIT  : ('0'..'9');

IDENTIFIER 
 : LETTER 
   | LETTER DIGIT (LETTER|DIGIT)+ 
   | LETTER LETTER (LETTER|DIGIT)*
 ;

WS  //Whitespace
  : (' ' | '\t' | '\n' | '\r' | '\f')+  {$channel = HIDDEN;}
;

If you only want a lexer, start your grammar with:

lexer grammar FooLexer; // creates: FooLexer.java

LT(int): Token can only be used inside parser rules (on a TokenStream). Inside lexer rules, you can only use LA(int): int that gets the next int (character) from the IntStream. But there is no need for all the manual look ahead. Just do something like this:

lexer grammar FooLexer;

BEGIN
  :  'BEGIN'
  ;

END
  :  'END'
  ;

FLOAT
  :  DIGIT+ '.' DIGIT+
  ;

INT
  :  DIGIT+
  ;

IDENTIFIER 
  :  LETTER (LETTER | DIGIT)*
  ;

WS
  :  (' ' | '\t' | '\n' | '\r' | '\f')+  {$channel = HIDDEN;}
  ; 

fragment LETTER : ('a'..'z' | 'A'..'Z');
fragment DIGIT  : ('0'..'9');

I don't see the need to create a token called KEYWORD that matches all keywords: you'll want to make a distinction between a BEGIN and END token, right? But if you really want this, simply do:

KEYWORD
  :  'BEGIN'
  |  'END'
  ;

and remove the BEGIN and END rules. Just make sure KEYWORD is defined before IDENTIFIER.

EDIT

Test the lexer with the following class:

import org.antlr.runtime.*;

public class Main {
  public static void main(String[] args) throws Exception {
    String src = "BEGIN END 3.14159 42 FOO";
    FooLexer lexer = new FooLexer(new ANTLRStringStream(src));
    while(true) {
      Token token = lexer.nextToken();
      if(token.getType() == FooLexer.EOF) {
        break;
      }
      System.out.println(token.getType() + " :: " + token.getText());
    }
  }
}

If you generate a lexer, compile the .java source files and run the Main class like this:

java -cp antlr-3.3.jar org.antlr.Tool FooLexer.g
javac -cp antlr-3.3.jar *.java
java -cp .:antlr-3.3.jar Main

the following output will be printed to the console:

4 :: BEGIN
11 ::  
5 :: END
11 ::  
7 :: 3.14159
11 ::  
8 :: 42
11 ::  
10 :: FOO

[From a guy who make a custom lexer tool, and still trying to learn ANTLR]

Boring extensive answer:

You are right. Many books & courses mix both tools. And sometimes "generating/detecting tokens" and "interpreting tokens" may mix.

Sometimes, a developer is trying to do a scanner, and still, mixes scanning & parsing in its mind ;-)

Usually, when detecting tokens, you also have to do an action ("interpretation"), as simple, as printing a message or the found token to string. Example: "{ cout << "Hey, I found a integer constant" << "\n" }"

There are also several cases that may make scanning difficult for a begginner in the topic.

One case is that several text may be used for different tokens.

Example:

"-" as the substraction binary operator, and "-" as the negative prefix operator. Or, treating 5 both as an integer and a float. In scanners, "-" can be seen as the same token, while in parsers, you may treat it as different tokens.

In order to fix this, my favorite approach its to use "generic tokens", in the scanning/lexer process, and later, convert them as "custom tokens" in the parsing/syntax process.

Quick answer:

As mentioned in previous answers, start with making a grammar, in fact, I suggest try it in a whiteboard or notebook, and later in your favorite (ANTLRL, other) scanning tool.

Consider those special cases, where there could be some token overlappings.

Good Luck.