Python: Trouble with YACC_问答_开发者_运维开发者技术经验分享

I'm using PLY to parse sentences like:

"CS 2310 or equivalent experience"

The desired output:

[[("CS", 2310)], ["equivalent experience"]]

YACC tokenizer symbols:

tokens = [
    'DEPT_CODE',
    'COURSE_NUMBER',
    'OR_CONJ',
    'MISC_TEX开发者_运维技巧T',
]

t_DEPT_CODE = r'[A-Z]{2,}'
t_COURSE_NUMBER  = r'[0-9]{4}'

t_OR_CONJ = r'or'

t_ignore = ' \t'

terms = {'DEPT_CODE': t_DEPT_CODE,
         'COURSE_NUMBER': t_COURSE_NUMBER,
         'OR_CONJ': t_OR_CONJ}

for name, regex in terms.items():
    terms[name] = "^%s$" % regex

def t_MISC_TEXT(t):
    r'\S+'
    for name, regex in terms.items():
        # print "trying to match %s with regex %s" % (t.value, regex)
        if re.match(regex, t.value):
            t.type = name
            return t

    return t

(MISC_TEXT is meant to match anything not caught by the other terms.)

Some relevant rules from the parser:

precedence = (
    ('left', 'MISC_TEXT'),
)


def p_statement_course_data(p):
    'statement : course_data'
    p[0] = p[1]

def p_course_data(p):
    'course_data : course'
    p[0] = p[1]


def p_course(p):
    'course : DEPT_CODE COURSE_NUMBER'
    p[0] = make_course(p[1], int(p[2]))


def p_or_phrase(p):
    'or_phrase : statement OR_CONJ statement'
    p[0] = [[p[1]], [p[3]]] 


def p_misc_text(p):
    '''text_aggregate : MISC_TEXT MISC_TEXT
                      | MISC_TEXT text_aggregate
                      | text_aggregate MISC_TEXT '''
    p[0] = "%s %s" % (p[0], [1])

def p_text_aggregate_statement(p):
    'statement : text_aggregate'
    p[0] = p[1]

Unfortunately, this fails:

# works as it should
>>> token_list("CS 2110 or equivalent experience")
[LexToken(DEPT_CODE,'CS',1,0), LexToken(COURSE_NUMBER,'2110',1,3), LexToken(OR_CONJ,'or',1,8), LexToken(MISC_TEXT,'equivalent',1,11), LexToken(MISC_TEXT,'experience',1,22)]

# fails. bummer.
>>> parser.parse("CS 2110 or equivalent experience")
Syntax error in input: LexToken(MISC_TEXT,'equivalent',1,11)

What am I doing wrong? I don't fully understand how to set precedence rules.

Also, this is my error function:

def p_error(p):
    print "Syntax error in input: %s" % p

Is there a way to see which rule the parser was trying when it failed? Or some other way to make the parser print which rules its trying?

UPDATE token_list() is just a helper function:

def token_list(string):
    lexer.input(string)
    result = []
    for tok in lexer:
        result.append(tok)
    return result

UPDATE 2: Here is the parsing that I want to happen:

Symbol Stack                                Input Tokens                                                Action
                                            DEPT_CODE COURSE_NUMBER OR_CONJ MISC_TEXT MISC_TEXT
DEPT_CODE                                   COURSE_NUMBER OR_CONJ MISC_TEXT MISC_TEXT                   Shift DEPT_CODE
DEPT_CODE COURSE_NUMBER                     OR_CONJ MISC_TEXT MISC_TEXT                                 Shift COURSE_NUMBER
course                                      OR_CONJ MISC_TEXT MISC_TEXT                                 Reduce course : DEPT_CODE COURSE_NUMBER
course_data                                 OR_CONJ MISC_TEXT MISC_TEXT                                 Reduce course_data : course
statement                                   OR_CONJ MISC_TEXT MISC_TEXT                                 Reduce statement : course_data
statement OR_CONJ                           MISC_TEXT MISC_TEXT                                         Shift OR_CONJ

statement OR_CONJ MISC_TEXT                 MISC_TEXT                                                   Shift MISC_TEXT
statement OR_CONJ text_aggregate            MISC_TEXT                                                   Reduce text_aggregate : MISC_TEXT
statement OR_CONJ text_aggregate MISC_TEXT                                                              Shift MISC_TEXT
statement OR_CONJ text_aggergate                                                                        Reduce text_aggregate : text_aggregate MISC_TEXT

statement OR_CONJ statement                                                                             Reduce statement : TEXT_AGGREGATE
or_phrase                                                                                               Reduce or_phrase : statement OR_CONJ statement
statement                                                                                               Reduce statement : or_phrase

I added this parsing action:

def p_misc_text_singleton(p):
    'text_aggregate : MISC_TEXT'
    p[0] = p[1]

When I try to build the parser, I get this output:

Generating LALR tables
WARNING: 2 shift/reduce conflicts
WARNING: 3 reduce/reduce conflicts
WARNING: reduce/reduce conflict in state 8 resolved using rule (text_aggregate -> MISC_TEXT MISC_TEXT)
WARNING: rejected rule (text_aggregate -> MISC_TEXT) in state 8

Parsing still fails on a syntax error, as above.

I can't reproduce your error, instead I get a syntax error on "or". You did not include a rule that uses or_phrase. When I include one, I get no errors.

I don't think it's a precedence issue. It would help if you should set up logging so you can see the steps PLY is taking and compare it to what you want to happen. To do this, pass debug=1 to the parse function (you might also have to pass that to yacc). Look at PLY's yacc.py if you can't get the debugging working.

The reduce/reduce conflict happens because it is ambiguous whether it should reduce MISC_TEXT MISC_TEXT to text_aggregate MISC_TEXT or if it should reduce MISC_TEXT MISC_TEXT to text_aggregate.

Without being able to reproduce the problem, my best guess at what would fix your error is to change the p_misc_text rule to:

'''text_aggregate : MISC_TEXT | text_aggregate MISC_TEXT'''

I think you can also delete the precedence tuple.