开发者

Can we really do without lazy quantifiers?

开发者 https://www.devze.com 2023-04-11 01:25 出处:网络
Many people say we can do without lazy quantifiers in regular expressions, but I\'ve just run into a problem that I can\'t solve without them (I\'m using sed here).

Many people say we can do without lazy quantifiers in regular expressions, but I've just run into a problem that I can't solve without them (I'm using sed here).

The string I want to process is composed of substrings separated by the word rate, for example:

anfhwe9.<<76xnf9247 rate 7dh3_29snpq+074j rate 48jdhsn3gus8 rate

I want to replace those substrings (apart from the word 'rate') with 3 dashes (---) each; the result should be:

---rate---rate---rate

From what I understand (I don't know Perl), it can be easily done using lazy quantifiers. In vim there are lazy quantifiers too; I did it using this command

:s/.\{-}rate/---rate/g

where \{-} tells vim to match as few as possible.

However, vim is a text editor and I need to run the script on many machines, some of which have no Perl installed. It could also be solved if you can tell the regex to not match an atomic grouping like .*[^(rate)]rate but that did not work.

Any id开发者_如何转开发eas how to achieve this using POSIX regex, or is it impossible?


In a case like this, I would use split():

perl -n -e 'print join ("rate", ("---") x split /rate/)' [input-file]


Are there any characters that are guaranteed not to be in the input? For instance, if '!' can't occur, you could transform the input to substitute that unique character, and then do a global replace on the transformed input:

sed 's/ rate /!/g' < input | sed -e 's/[^!]*/---/g' -e 's/!/rate/g'

Another alternative is to use awk's split command in an analogous way to the perl suggestion above, assuming awk is any more reliably available than perl.

awk '
{   ans="---"
    n=split($0, x, / rate /);
    while ( n-- ) { ans = ans "rate---";}
    print ans
}'


It's not easy without using lazy quantifiers or negative lookaheads (neither of which POSIX supports), but this seems to work.

([^r]*((r($|[^a]|a([^t]|$)|at([^e]|$))))?)+rate

I vaguely recall POSIX character classes being a bit persnickety. You may need to alter the character classes in that regex if they're not already POSIX-compliant.


The fact that you don't care about the contents of the substrings opens up a lot of options. For example, to add to Bob Lied's suggestion — even if '!' can occur in the input, you can start by changing it to something else:

sed -e 's/!/./g' -e 's/rate/!/g' -e 's/[^!]\+/---/g' -e 's/!/rate/g' <input >output


With awk:

awk -Frate '{ 
  for (i = 0; ++i <= NF;) 
    $i = (i == 1 || i == NF) && $i == x ? x : "---" 
  }1' OFS=rate infile   


Or, awk 'BEGIN {OFS=FS="rate"} {for (i=1; i<=NF-1; i++) {$i = "---"}; print}'

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号