开发者

Regular expressions -- how to allow non-adjacent alternatives? [closed]

开发者 https://www.devze.com 2023-04-12 02:46 出处:网络
It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical andcannot be reasonably answered in its current form. For help clari
It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, v开发者_开发问答isit the help center. Closed 11 years ago.

In a translation-testing app (in Python) I want a regular expression that will accept either of these two strings:

a = "I want the red book"
b = "the book which I want is red"

So far I'm using something like this:

^(the book which )*I want (is |the )red (book)*$

This will accept both string a and string b. But it will also accept a string without either of the two optional sub-strings:

sub1 = (the book which )
sub2 = (book)

How can I indicate that one of these two substrings must be present, even though they're not adjacent?

I realize that in this example it would be trivially easy to avoid the problem by just testing for longer alternatives separated by "or" |. This is a simplified example of a problem that is harder to avoid with the actual user input I'm working with.


How can I indicate that one of these two substrings must be present, even though they're not adjacent?

I am assuming that is the core question you have.

The solution is two regex's. Why people feel that once the say import re that the regex has to be a single line is just beyond me.

First test for the first substring in one regex, then test for the other substring with another regex. Logically combine those two results.


This looks like a problem that might be better solved with a difflib.SequenceMatcher than with regular expressions.

However, a regular expression that works for the specific example in the original question is as follows:

^(the book which )*I want (is |the )red((?(1)(?: book)*| book))$

This will fail for the string "I want the red" (which lacks both of the required substrings "the books which " and " book"). This uses the (?(id/name)yes-pattern|no-pattern) syntax which allows for alternatives based on the existence of a previously matched group.


import re

regx1 = re.compile('^(the book which )*I want (is |the )red'   '((?(1)|(?: book)))$')

regx2 = re.compile('^(the book which )*I want (is |the )red'   '((?(1)(?: book)*|(?: book)))$')




for x in ("I want the red book",
          "the book which I want is red",
          "I want the red",
          "the book which I want is red book"):
    print x
    print regx1.search(x).groups() if regx1.search(x) else 'No match'
    print regx2.search(x).groups() if regx2.search(x) else 'No match'
    print

result

I want the red book
(None, 'the ', ' book')
(None, 'the ', ' book')

the book which I want is red
('the book which ', 'is ', '')
('the book which ', 'is ', '')

I want the red
No match
No match

the book which I want is red book
No match
('the book which ', 'is ', ' book')

edit

Your regex pattern

^(the book which )*I want (is |the )red (book)*$

doesn't match correctly for all the sentences because of the last blank in it.

It must be

'^(the book which )*I want (is |the )red( book)*$'
0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号