开发者

Find last match with python regular expression

开发者 https://www.devze.com 2022-12-29 23:19 出处:网络
I want to match the last occurrence of a simple pattern in a stri开发者_开发百科ng, e.g. list = re.findall(r\"\\w+ AAAA \\w+\", \"foo bar AAAA foo2 AAAA bar2\")

I want to match the last occurrence of a simple pattern in a stri开发者_开发百科ng, e.g.

list = re.findall(r"\w+ AAAA \w+", "foo bar AAAA foo2 AAAA bar2")
print "last match: ", list[len(list)-1]

However, if the string is very long, a huge list of matches is generated. Is there a more direct way to match the second occurrence of " AAAA ", or should I use this workaround?


you could use $ that denotes end of the line character:

>>> s = """foo bar AAAA
foo2 AAAA bar2"""
>>> re.findall(r"\w+ AAAA \w+$", s)
['foo2 AAAA bar2']

Also, note that list is a bad name for your variable, as it shadows built-in type. To access the last element of a list you could just use [-1] index:

>>> lst = [2, 3, 4]
>>> lst[-1]
4


You can avoid the building of a list just by iterating over all matches and keeping the last match:

def match_last(orig_string, re_prefix, re_suffix):

    # first use positive-lookahead for the regex suffix
    re_lookahead= re.compile(f"{re_prefix}(?={re_suffix})")

    match= None
    # then keep the last match
    for match in re_lookahead.finditer(orig_string):
        pass

    if match:
        # now we return the proper match

        # first compile the proper regex…
        re_complete= re.compile(re_prefix + re_suffix)

        # …because the known start offset of the last match
        # can be supplied to re_complete.match
        return re_complete.match(orig_string, match.start())

    return match

After this, match holds either the last match or None.
This works for all combinations of pattern and searched string, as long as any possibly-overlapping regex parts are provided as re_suffix ; in this case, \w+.

>>> match_last(
    "foo bar AAAA foo2 AAAA bar2",
    r"\w+ AAAA ", r"\w+")
<re.Match object; span=(13, 27), match='foo2 AAAA bar2'>


There is no built-in re library feature that supports right-to-left string parsing, the input string is only searched for a pattern from left to right.

There is a PyPi regex module that supports this feature, however. It is regex.REVERSE flag, or its inline variation, (?r):

s="foo bar AAAA foo2 AAAA bar2"
print(regex.search(r"(?r)\w+ AAAA \w+$", s).group())
# => foo2 AAAA bar2

With re module, there is a way to quickly get to the end of string using ^[\s\S]* construct and let backtracking find the pattern that you'd like to caputure into a separate group. However, backtracking may gobble part of the match (as it will stop yielding more text once all subsequent patterns match), and in case the text is too large and there is no match, backtracking may become catastrophic. Only use this trick if your input string always matches, or if it is short and the custom pattern is not relying on backtracking much:

print(re.search(r"(?:^[\s\S]*\W)?(\w+ AAAA \w+)$", s).group(1))
# => foo2 AAAA bar2

Here, (?:^[\s\S]*\W)? matches an optional sequence of a start of string, any 0 or more chars followed with a non-word char (\W). It is necessary to add \W to make backtracking get back to the non-word char, and it must be optional as the match might start at the start of the string.

See the Python demo.


I wasn't sure if your original regex would give you what you wanted. So sorry if I'm late to the party. But others may find this useful too.

import re
p = r"AAAA(?=\s\w+)" #revised per comment from @Jerry
p2 =r"\w+ AAAA \w+"
s = "foo bar AAAA foo2 AAAA bar2"
l = re.findall(p, s)
l2 = re.findall(p2, s)
print('l: {l}'.format(l=l))

#print(f'l: {l}') is nicer, but online interpreters sometimes don't support it.
# https://www.onlinegdb.com/online_python_interpreter
#I'm using Python 3.

print('l2: {l}'.format(l=l2))
for m in re.finditer(p, s):
  print(m.span())
  #A span of (n,m) would really represent characters n to m-1 with zero based index
  #So.(8,12):
  # => (8,11: 0 based index)
  # => (9th to 12th characters conventional 1 based index)
print(re.findall(p, s)[-1])

Outputs:

l: ['AAAA', 'AAAA']
l2: ['bar AAAA foo2']
(8, 12)
(18, 22)   
AAAA

The reason you get two results here instead of one in the original is the (?=) special sauce.

It's called a positive lookahead. It does not 'consume' (i.e. advance the cursor), when the match is found during the regex evaluation. So, it comes back after matching.

Although positive lookaheads are in parenthesis, they also act as a non-capture group.

So, although a pattern is matched, the results omit the surrounding sequence of alphanumeric characters represented by the \w+ and the intervening spaces, \s in my example -- representing [ \t\n\r\f\v]. (More here)

So I only get back AAAA each time.

p2 here, represents the original pattern of the code of @SDD, the person posing the question.

foo2 is consumed with that pattern, so the second AAAA would not match, as the cursor had advanced too far, when the regex engine recommences on its second iteration of matching.


I recommend taking a look at Moondra's Youtube videos if you want to dig in deeper.

He has done a very thorough 17 part series on Python regexes, beginning here


Here's a link to an online Python Interpreter.


Another fast way is using search, and group:

>>> re.search('\w+ AAAA \w+$',"foo bar AAAA foo2 AAAA bar2").group(0)
'foo2 AAAA bar2'

What it does:

  1. It uses the pattern of \w+ AAAA \w+$, which get's the last occurrence of 'AAAA' with there be-siding words alongside of them, all using \w+ (twice), and $ (once).

  2. After the process of the pattern matching, you will have to use the _sre.SRE_Match.group method to get the belonging value of the _sre.SRE_Match object, and of course get the zeroth (first) group, as know that search only retains one match (the zeroth).

Here is the regex101 of it.

Here are the timings for all of the answers (except for JGFMK's answer, since it is hard):

>>> timeit.timeit(lambda: re.findall(r"\w+ AAAA \w+$", s),number=1000000) # SilentGhost
5.783595023876842
>>> timeit.timeit('import re\nfor match in re.finditer(r"\w+ AAAA \w+", "foo bar AAAA foo2 AAAA bar2"):pass',number=1000000) # tzot
5.329235373691631
>>> timeit.timeit(lambda: re.search('\w+ AAAA \w+$',"foo bar AAAA foo2 AAAA bar2").group(0),number=1000000) # mine (U9-Forward)
5.441731174121287
>>> 

I am testing all the timings using timeit module, and also i am making number=1000000 so it takes much longer.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号