开发者

Need python Regex for handling sub-string

开发者 https://www.devze.com 2023-04-09 22:20 出处:网络
I want to ch开发者_运维技巧eck where the string (Product Name) contains the word beta, Since I am not so good in regex writing :

I want to ch开发者_运维技巧eck where the string (Product Name) contains the word beta, Since I am not so good in regex writing : eg.

"Crome beta"
"Crome_beta"
"Crome beta2"
"Crome_betaversion"
"Crome 3beta" 
"CromeBerta2.3"
"Beta Crome 4" 

So that I can raise error that this is not valid product name , its a product version. i wrote a regex which is able to cought the above strings

parse_beta = re.compile( "(beta)", re.I)
if re.search(parse_data, product_name):
     logging error 'Invalid product name'

But if the product name contains the word having substring beta init like "tibetans product" so the above regex it is parsing beta and raising error. i want to handle this case.Any one can suggest me some regex.

Thanks a lot.


Try ((?<![a-z])beta|cromebeta). (the word beta not preceded by a letter or the full word cromebeta)

I'll add a quote from http://docs.python.org/library/re.html to explain the first part.

(?<!...) Matches if the current position in the string is not preceded by a match for .... This is called a negative lookbehind assertion. Similar to positive lookbehind assertions, the contained pattern must only match strings of some fixed length. Patterns which start with negative lookbehind assertions may match at the beginning of the string being searched.


We should cover all the cases of beta version names, where the regexp should give a match.

So we start writing the pattern with the first example of beta "Crome beta":

' [Bb]eta'

We use [Bb] to match B or b in the second place.

The second example "Crome_beta" adds _ as a separator:

'[ _][Bb]eta'

The third "Crome beta2" and the forth "Crome_betaversion" examples are covered by the last regexp.

The fifth example "Crome 3beta" forces us to change the pattern this way:

'[ _]\d*[Bb]eta'

where \d is a substitute for [0-9] and * allows from 0 to infinity elements of \d.

The sixth example "CromeBeta2.3" shows that Beta can have no preceding _ or space, just start with the capital. So we cover it with | construction which is the same as or operator in Python:

'[ _]\d*[Bb]eta|Beta'

The seventh example Beta Crome 4 is matched by the least regexp (since it starts with Beta). But it can also be beta Chrome 4, so we would change the pattern this way:

'[ _]\d*[Bb]eta|Beta|^beta '

We don't use ^[Bb]eta since Beta is already covered.

Also, I should mention, we can't use re.I since we have to differentiate between beta and Beta in the regex.

So, the test code is (for Python 2.7):

from __future__ import print_function
import re, sys

match_tests = [
"Crome beta",
"Chrome Beta",
"Crome_beta",
"Crome beta2",
"Crome_betaversion",
"Crome 3beta" ,
"Crome 3Beta",
"CromeBeta2.3",
"Beta Crome 4",
"beta Chrome ",
"Cromebeta2.3" #no match,
"betamax" #no match,
"Betamax"]

compiled = re.compile(r'[ _]\d*[Bb]eta|Beta|^beta ')
for test in match_tests:
    search_result = compiled.search(test)
    if search_result is not None:
        print("{}: OK".format(test))
    else:
        print("{}: No match".format(test), file=sys.stderr)

I don't see any need to use negative lookbehind. Also, you used a capturing group (beta) (parenthesis). There is no need for it either. It would just slow down the regexp.


Seems like you've actually got two concepts in the Product Name string: Product and version, with a separator of whitespace and underscore, from the examples you gave. Use a regex such that splits the two concepts, and search for the word beta only in the version concept.


"[Bb]eta(\d+|$|version)|^[Bb]eta "

test with grep:

kent$  cat a                                            
Crome beta
Crome_beta
Crome beta2
Crome_betaversion
Crome 3beta
CromeBeta2.3
tibetans product
Beta Crome 4


kent$  grep -P "[Bb]eta(\d+|$|version)|^[Bb]eta " a     
Crome beta
Crome_beta
Crome beta2
Crome_betaversion
Crome 3beta
CromeBeta2.3
Beta Crome 4
0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号