开发者

Regular expression to match Razor like expressions

开发者 https://www.devze.com 2023-04-12 20:04 出处:网络
I\'ve been trying to figure out how to match Razor-like embedded expressions. This isn\'t true Razor syntax, just something similar.

I've been trying to figure out how to match Razor-like embedded expressions. This isn't true Razor syntax, just something similar.

Example:

Given the following string:

This @ShouldMatch1 and this @ShouldMatch2 and this @((ShouldNotMatch1)andthis@ShouldMatch3) and this @(ShouldNotMatch2 and this @1ShouldNotMatch3 and this @((ShouldNotMatch4 and this @(ShouldMatch4))

  • Match and Capture:
    • ShouldMatch1
    • ShouldMatch2
    • ShouldMatch3
    • ShouldMatch4

Basically, here are the requireme开发者_JAVA技巧tns:

  • if it starts with @ and then [a-zA-Z]+[0-9]* then I want to match it.
  • if it starts with @(, then I only want to match if it's followed by [a-zA-Z]+[0-9]* and then a ).

Here's what I have as a start, and it's working for the most part, but it's matching ShouldNotMatch2

\@[(]?([a-zA-Z]+[0-9]*[)]*)


If your regex engine supports conditionals:

@(\()?([A-Za-z]+[0-9]*)(?(1)\))

Explanation:

@           # Match @
(\()?       # Optionally match `(` and capture in group 1
(           # Match and capture in group 2
 [A-Za-z]+  # 1+ ASCII letters
 [0-9]*     # 0+ ASCII digits
)           # End of capturing group
(?(1)       # If group 1 participated in the match
 \)         # match a closing parenthesis
)           # End of conditional


This code:

#!/usr/bin/env perl

$_ = <<'LA VISTA, BABY';  # the Terminator, of course :)
    This @ShouldMatch1 and this @ShouldMatch2 and this @((ShouldNotMatch1)andthis@ShouldMatch3) and this @(ShouldNotMatch2 an
d this @1ShouldNotMatch3 and this @((ShouldNotMatch4 and this @(ShouldMatch4))'
LA VISTA, BABY

print $+{id}, "\n" while m{
    @ (?: \(  (?<id> \pL+ \d* )  \)
        |     (?<id> \pL+ \d* )
      )
}gx;

When run will print out your desired output of:

ShouldMatch1
ShouldMatch2
ShouldMatch3
ShouldMatch4

EDIT

Here is a five-stage elaboration of the previous solution from simpler to fancier. However, it is still unclear what the real rule for the identifier is, or should be.

  1. original question: \pL+\d* That says it starts with letters and then might end with digits, but doesn’t have to.
  2. \pL+\d+ makes digits mandatory.
  3. \pL[\pL\d]* must start with a letter but then allows letters and digits to intermix.
  4. \pL\w* adds underscore to the things that can come after the initial letter. Technically, \w according to UTS#18 is supposed to be all letters, all marks, the letter numbers (like Roman numerals), all decimal numbers, plus all connector punctuation.
  5. \w+ is for all alphabetics, digits, or underscores through out, without restriction. That’s what word characters are according to the standard.
  6. (?=\pL)\w+(?<=\d) adds a constraint that it must start with a letter and end with a digit, but otherwise can be combination of word characters.

No matter which of those is actually needed — it’s quite unclear — it should be easy enough to update the code to use the appropriate variant, especially in the last two versions where the definition of what counts as these funny identifiers occurs in just one place in the code. That makes it easy to change it in just one place and gets rid of update-incoherency bugs. Programmers should always strive to factor out duplicate code, and this is true no matter their programming language, even regexes, because abstraction is fundamental to good design.

Here then is the 5-way version:

#!/usr/bin/env perl

$_ = <<'LA VISTA, BABY';  # the Terminator, of course :)
    This @ShouldMatch1 and this @ShouldMatch2 and this @((ShouldNotMatch1)andthis@ShouldMatch3) and this @(ShouldNotMatch2 and this @1ShouldNotMatch3 and this @((ShouldNotMatch4 and this @(ShouldMatch4))'
LA VISTA, BABY

$mask  = "Version %d: %s\n";
$verno = 0;

##########################################################
# Simplest version: nothing fancy
++$verno;
printf($mask, $verno, $+) while /\@(?:(\pL+\d*)|\((\pL+\d*)\))/g;
print "\n";

##########################################################
# More readable version: add /x for spacing out regex contents
++$verno;
printf($mask, $verno, $+) while / \@ (?: (\pL+\d*) | \( (\pL+\d*) \) ) /xg;
print "\n";

##########################################################
# Use vertical alignment for greatly improved legibility,
# plus named captures for convenience and self-documentation
++$verno;
printf($mask, $verno, $+{id}) while m{
    @ (?: \(  (?<id> \pL+ \d* )  \)
        |     (?<id> \pL+ \d* )
      )
}xg;
print "\n";

##########################################################
# Define the "id" pattern separately from executing it
# to avoid code duplication. Improves maintainability.
# Likely requires Perl 5.10 or better, or PCRE, or PHP.
++$verno;
printf($mask, $verno, $+)     while m{
    (?(DEFINE)  (?<id> \pL+ \d* )   )

    @ (?: \( ((?&id)) \)
        |    ((?&id))
      )
}xg;
print "\n";

##########################################################
# this time we use a named capture that is different from
# the named group used for the definttion.
++$verno;
printf($mask, $verno, $+{id}) while m{
    (?(DEFINE)  (?<word> \pL+ \d* )   )

    @ (?: \( (?<id> (?&word) ) \)
        |    (?<id> (?&word) )
      )
}xg;

When run on Perl v5.10 or better, that duly produces:

Version 1: ShouldMatch1
Version 1: ShouldMatch2
Version 1: ShouldMatch3
Version 1: ShouldMatch4

Version 2: ShouldMatch1
Version 2: ShouldMatch2
Version 2: ShouldMatch3
Version 2: ShouldMatch4

Version 3: ShouldMatch1
Version 3: ShouldMatch2
Version 3: ShouldMatch3
Version 3: ShouldMatch4

Version 4: ShouldMatch1
Version 4: ShouldMatch2
Version 4: ShouldMatch3
Version 4: ShouldMatch4

Version 5: ShouldMatch1
Version 5: ShouldMatch2
Version 5: ShouldMatch3
Version 5: ShouldMatch4

It should be easy to update the definition of an id to match whatever is actually needed.

Note that some regex engines make it gratuitously cumbersome to specify properties. For example, they might require \p{L} instead of the normal \pL. That’s a Huffman‐encoding failure in their misdesign, because you always want the most commonly used form to be the shortest form. With \pL or \pN being just one character longer than \w or \d, people are much more inclined to reach for the improved versions, but things like \p{L} and \p{N} are now three characters longer than \w and \d, plus have unnecessary visual clutter to boot. You shouldn’t have to pay triple just to get the “normal” ᴀᴋᴀ most common case. :(

If you are going to put the ugly braces in, then you might as well write the thing out in full as \p{Letter} and \p{Number}. After all, “in for a penny, in for a pound,” as they say.


It's either @(foo1) or @foo1, so use alternation:

@([a-zA-Z]+[0-9]|\([a-zA-Z]+[0-9]\))

and get rid of the parentheses in a second step (simple string replace).

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号