开发者

Java regex mix two patterns

开发者 https://www.devze.com 2023-04-11 13:51 出处:网络
How can i get this pattern to work: Pattern pattern = Pattern.compile(\"[\\\\p{P}\\\\p{Z}]\"); Basically, this will split my String[] sentence by any kind of punctuation character (p{P} or any kind

How can i get this pattern to work:

Pattern pattern = Pattern.compile("[\\p{P}\\p{Z}]");

Basically, this will split my String[] sentence by any kind of punctuation character (p{P} or any kind of whitespace (p{Z}). But i want 开发者_如何学Cto exclude the following case:

(?<![A-Za-z-])[A-Za-z]+(?:-[A-Za-z]+){1,}(?![A-Za-z-])

pattern explained here: Java regex patterns

which are the hyphened words like this: "aaa-bb", "aaa-bb-cc", "aaa-bb-c-dd". SO, i can i do that?


Unfortunately it seems like you can't merge both expressions, at least as far as I know.

However, maybe you can reformulate your problem.

If, for example, you want to split between words (which can contain hyphens), try this expression:

(?>[^\p{L}-]+|-[^\p{L}]+|^-|-$)

This should match any sequence of non-letter characters that are not a minus or any minus that is followed my a non-letter character or that is the first or last character in the input.

Using this expression for a split should result in this:

input="aaa-bb, aaa-bb-cc, aaa-bb-c-dd,no--match,--foo"
ouput={"aaa-bb","aaa-bb-cc","aaa-bb-c-dd","no","match","","foo"}

The regex might need some additional optimization but it is a start.

Edit: This expression should get rid of the empty string in the split:

(?>[^\p{L}-][^\p{L}]*|-[^\p{L}]+|^-|-$)

The first part would now read as "any non-character which is not a minus followed by any number of non-character characters" and should match .-- as well.

Edit: in case you want to match words that could potentially contain hyphens, try this expression:

(?>(?<=[^-\p{L}])|^)\p{L}+(?:-\p{L}+)*(?>(?=[^-\p{L}])|$)

This means "any sequence of letters (\p{L}+) followed by any number of sequences consisting of one minus and at least one more letters ((?:-\p{L}+)*+). That sequence must be preceeded by either the start or anything not a letter or minus ((?>(?<=[^-\p{L}])|^)) and be followed by anything that is not a letter or minus or the end of the input ((?>(?=[^-\p{L}])|$))".

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号