BNF: input going to wrong nonterminal_问答_开发者

I'm developing a BNF for chess algebraic notation and ran into an interesting case, input going to the wrong non-terminal.

My start BNF rule is as follows (note that this intentionally doesn't include castling or notes):

algebraic_notation : piece start_position capture end_position promotion

piece, start_position, capture, and promotion can be empty, thus allowing for a move like 'd4'. The problem is that when such a move is entered, the input ('d4') is taken by start_position, thus resulting in an error b/c there is no more input for end_position, which cannot be empty.

The obvious hack/workaround is to allow end_position to be empty and then check to see if we got any input for it and act accordingly.

This does work, but I would like to know if there is a way to deal with this. Is it possible for input not to go to the first matching symbol if it causes the entire expression not to match?

Another question is whether this is standard behaviour for BNF, or a problem with the yaccer I'm using: PLY v 3.3.

Tried using flex/bison and got same thing. So it appears its not specific to PLY.

Here are all the relevant rules for completeness:

algebraic_notation : piece start_pos开发者_高级运维ition capture end_position promotion

piece : KING
        | QUEEN
        | BISHOP
        | KNIGHT
        | ROOK
        | pawn

pawn : empty

start_position : FILE
                | NUMBER
                | FILE NUMBER
                | empty

end_position : FILE NUMBER
                | empty                 // this line is the hack/workaround

capture : CAPTURE
        | empty

promotion : EQUAL QUEEN
            | EQUAL ROOK
            | EQUAL KNIGHT
            | EQUAL BISHOP
            | empty

empty :

The problem is that you're ignoring the shift/reduce conflict you get from your parser generator. While yacc/bison (and presumably PLY) will resolve errors for you, that resolution might not be doing what you want, and might result in a parser that parses a langauge other than the one you are trying to parse.

Whenever you get a shift/reduce (or reduce/reduce) conflict from an LR parser generator, you really need to understand what the conflict is (and why it occurs) to know whether you can ignore it or whether you need to fix it. So lets fix your grammar by getting rid of the 'hack' (which is clearly wrong and not something you want to parse), as well as the useless 'empty' rule (which just confuses things):

%token FILE NUMBER
%%
algebraic_notation : piece start_position capture end_position promotion
piece : 'K' | 'Q' | 'B' | 'N' | 'R' | /*pawn*/
start_position : FILE | NUMBER | FILE NUMBER | /*empty*/
end_position : FILE NUMBER
capture : 'x' | /*empty*/
promotion : '=' 'Q' | '=' 'R' | '=' 'N' | '=' 'B' | /*empty*/

Now when you run this through 'bison -v' (ALWAYS use -v to get the verbose output file -- I'm not sure what PLY's equivalent is), you get the message about a shift/reduce conflict, and if you look in the .output file you can see what it is:

state 7

    1 algebraic_notation: piece . start_position capture end_position promotion

    FILE    shift, and go to state 9
    NUMBER  shift, and go to state 10

    FILE      [reduce using rule 11 (start_position)]
    $default  reduce using rule 11 (start_position)

    start_position  go to state 11

This is telling you that after seeing a piece, when the next token is FILE, it doesn't know whether it should shift (treating the FILE as (part of) the start_position) or reduce (giving an empty start_position). That's because it needs more lookahead to see if there's a second position to use as an end_position to know what to do, so simply ignoring the conflict will result in a parser that fails to parse lots of valid things (basically, anything with an empty start_position and capture).

The best way to solve a lookahead-related shift-reduce conflict involving an empty production like this (or pretty much any conflict involving an empty production, really) is to unfactor the grammar -- get rid of the empty rule and duplicate any rule that uses the non-terminal both with and without it. In your case, this gives you the rules:

algebraic_notation : piece capture end_position promotion
algebraic_notation : piece start_position capture end_position promotion
start_position : FILE | NUMBER | FILE NUMBER

(the other rules are unchanged) With that you still have a shift-reduce conflict:

state 7

    1 algebraic_notation: piece . capture end_position promotion
    2                   | piece . start_position capture end_position promotion

    FILE    shift, and go to state 9
    NUMBER  shift, and go to state 10
    'x'     shift, and go to state 11

    FILE  [reduce using rule 14 (capture)]

    start_position  go to state 12
    capture         go to state 13

Basically, we've just moved the conflict one step and now have the problem with the empty capture rule. So we unfactor that as well:

algebraic_notation : piece end_position promotion
algebraic_notation : piece capture end_position promotion
algebraic_notation : piece start_position end_position promotion
algebraic_notation : piece start_position capture end_position promotion
capture : 'x'

and now bison reports no more conflicts, so we can be reasonably confident it will parse the way we want. You can simplify it a bit more by getting rid of the capture rule and using a literal 'x' in the algebraic_notation rule. I personally prefer this, as I think it is clearer to avoid the unnecessary indirection:

%token FILE NUMBER
%%
algebraic_notation : piece end_position promotion
algebraic_notation : piece 'x' end_position promotion
algebraic_notation : piece start_position end_position promotion
algebraic_notation : piece start_position 'x' end_position promotion
piece : 'K' | 'Q' | 'B' | 'N' | 'R' | /*pawn*/
start_position : FILE | NUMBER | FILE NUMBER
end_position : FILE NUMBER
promotion : '=' 'Q' | '=' 'R' | '=' 'N' | '=' 'B' | /*empty*/

I haven't used PLY, and without seeing the full flex/bison files you tried I might be picking on a non-issue, but it seems to me you aren't giving the parser an idea that no more is coming for the current algebraic_notation rule. You don't say how you know the input 'd4' was matched to start_position, but if the parser knew it had all tokens for the rule, and the only non-empty token is end_position, it would have to match 'd4' to that.

What about introducing a token that marks the end of a line, like EOL. So your first rule becomes:

algebraic_notation : piece start_position capture end_position promotion EOL

and the parser now sees the token 'd4' followed by EOL -- does that change the behavior?

What happens if you wrap start_position capture end_position into a middle block, and remove FILE NUMBER from start_pos, to something like this:

middle: start_pos capture end_pos
      | end_pos capture end_pos
      | capture end_pos

start_pos : FILE
      | NUMBER
      | empty

end_pos : FILE NUMBER

capture : CAPTURE
      | empty

This question is a good illustration of a problem is computer science theory, the removal of epsilon (or empty) productions from a grammar. The problems with the ambiguity of the chess notation can be resolved (for yacc or PLY) by simplifying the grammar to remove the empty productions. There is much material on this on both SO/SE and on other sites. I append a bibliography for the interested reader.

By performing a mindless transformation of the rules to remove blind/empty/epsilon productions we get the following CFG:

algebraic_notation : piece start_position capture end_position promotion
                   | piece start_position capture end_position 
                   | piece start_position capture promotion
                   | piece start_position end_position promotion
                   | piece capture end_position promotion
                   | piece start_position capture
                   | piece start_position end_position
                   | piece capture end_position
                   | piece start_position promotion
                   | piece capture promotion
                   | piece end_position promotion
                   | piece promotion
                   | piece end_position 
                   | piece capture 
                   | piece start_position 
                   | piece 
                   | start_position capture end_position promotion
                   | start_position capture end_position
                   | start_position capture promotion
                   | start_position end_position promotion
                   | capture end_position promotion
                   | start_position capture
                   | start_position end_position
                   | capture end_position
                   | end_position promotion
                   | start_position 
                   | capture 
                   | end_position
                   | promotion
piece : KING
        | QUEEN
        | BISHOP
        | KNIGHT
        | ROOK

start_position : FILE
                | NUMBER
                | FILE NUMBER

end_position : FILE NUMBER

capture : CAPTURE

promotion : EQUAL QUEEN
            | EQUAL ROOK
            | EQUAL KNIGHT
            | EQUAL BISHOP

(This could probably be simplified by removing those combinations that could not occur in chess notation, but that is an exercise for the reader).

Bibliography

The best books for this are probably:

Hopcroft & Ullman Introduction to Automata Theory, Languages, and Computation
Aho & Ullman The Theory of Parsing, Translation, and Compiling

Or Just go to the slides from Jeff Ullman's class:

Normal Forms for CFG's

Or a bunch of related questions on SO/SE:

https://math.stackexchange.com/questions/563363/removing-epsilon-productions-from-a-context-free-grammar
Removing epsilon production from context-free grammar
Converting grammar to Chomsky Normal Form?