开发者

Regular expression to match content until multi-character string

开发者 https://www.devze.com 2023-03-13 13:21 出处:网络
I\'ve got defective input coming in that looks like this... foo<p>bar</p> And I want to normalize it to wrap the leading text in a p tag:

I've got defective input coming in that looks like this...

foo<p>bar</p>

And I want to normalize it to wrap the leading text in a p tag:

<p>foo</p><p>bar</p>

This is easy enough with the regex replace of /^([^<]+)/ with <p>$1</p>. Problem is, sometimes the leading chunk contains tags other than p, like so:

foo <b>bold</b><p>bar</p>

This should wrap the whole chunk in a new p:

<p>foo <b>bold</b></p><p>bar</p开发者_JAVA百科>

But since the simple regex looks only for <, it stops at <b> and spits out:

<p>foo </p><b>bold</b><p>bar</p> <!-- oops -->

So how do I rewrite the regex to match <p? Apparently the answer involves negative lookahead, but this is a bit too deep for me.

(And before the inevitable "you can't parse HTML with regexes!" comment, the input is not random HTML, but plain text annotated with only the tags <p>, <a>, <b> and <i>, and a/b/i may not be nested.)


I think you actually want positive lookahead. It's really not bad:

/^([^<]+)(?=<p)/

You just want to make sure that whatever comes after < is p, but you don't want to actually consume <p, so you use a lookahead.

Examples:

> var re = /^([^<]+)(?=<p)/g;

> 'foo<p>bar</p>'.replace(re, '<p>$1</p>');
  "<p>foo</p><p>bar</p>"

> 'foo <b>bold</b><p>bar</p>'.replace(re, '<p>$1</p>')
  "foo <b>bold</b><p>bar</p>"

Sorry, wasn't clear enough in my original posting: my expectation was that the "foo bold" bit would also get wrapped in a new p tag, and that's not happening.

Also, every now and then there's input with no p tags at all (just plain foo), and that should also map to <p>foo</p>.

The easiest way I found to get this working is to use 2 separate regexps, /^(.+?(?=<p))/ and /^([^<]+)/.

> var re1 = /^(.+?(?=<p))/g,
      re2 = /^([^<]+)/g,
      s = '<p>$1</p>';

> 'foo<p>bar</p>'.replace(re1, s).replace(re2, s);
  "<p>foo</p><p>bar</p>"

> 'foo'.replace(re1, s).replace(re2, s);
  "<p>foo</p>"

> 'foo <b>bold</b><p>bar</p>'.replace(re1, s).replace(re2, s);
  "<p>foo <b>bold</b></p><p>bar</p>"

It's possible to write a single, equivalent regexp by combining re1 and re2:
/^(.+?(?=<p)|[^<]+)/

> var re3 = /^(.+?(?=<p)|[^<]+)/g,
      s = '<p>$1</p>';

> 'foo<p>bar</p>'.replace(re3, s)
  "<p>foo</p><p>bar</p>"

> 'foo'.replace(re3, s)
  "<p>foo</p>"

> 'foo <b>bold</b><p>bar</p>'.replace(re3, s)
  "<p>foo <b>bold</b></p><p>bar</p>"
0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号