开发者

java.lang.StackOverflowError while using a RegEx to Parse big strings

开发者 https://www.devze.com 2023-04-06 10:00 出处:网络
This is my Regex ((?:(?:\'[^\']*\')|[^;])*)[;] It tokenizes a string on semicolons. For example, Hello world; I am having a problem; using regex;

This is my Regex

((?:(?:'[^']*')|[^;])*)[;]

It tokenizes a string on semicolons. For example,

Hello world; I am having a problem; using regex;

Result is three strings

Hello world
I am having a problem
using regex

But when I use a large input string I get this error

Exception in thread "main" java.lang.StackOverflowError
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4168)
at java.util.regex.Pattern$Loop.match(Pattern.java:4295)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4227)
at java.util.regex.Pattern$BranchConn.match(Patte开发者_C百科rn.java:4078)
at java.util.regex.Pattern$CharProperty.match(Pattern.java:3345)
at java.util.regex.Pattern$Branch.match(Pattern.java:4114)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4168)
at java.util.regex.Pattern$Loop.match(Pattern.java:4295)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4227)

How is this caused and how can I solve it?


Unfortunately, Java's builtin regex support has problems with regexes containing repetitive alternative paths (that is, (A|B)*). This is compiled into a recursive call, which results in a StackOverflow error when used on a very large string.

A possible solution is to rewrite your regex to not use a repititive alternative, but if your goal is to tokenize a string on semicolons, you don't need a complex regex at all really, just use String.split() with a simple ";" as the argument.


If you really need to use a regex that overflows your stack, you can increase the size of your stack by passing something like -Xss40m to the JVM.


It might help to add a + after the [^;], so that you have fewer repetitions.

Isn't there also some construct that says “if the regular expression matched up to this point, don't backtrace”? Maybe that comes in handy, too. (Update: it is called possessive quantifiers).

A completely different alternative is to write a utility method called splitQuoted(char quote, char separator, CharSequence s) that explicitly iterates over the string and remembers whether it has seen an odd number of quotes. In that method you could also handle the case that the quote character might need to be unescaped when it appears in a quoted string.

'I'm what I am', said the fox; and he disappeared.
'I\'m what I am', said the fox; and he disappeared.
'I''m what I am', said the fox; and he disappeared.
0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号