Java .split() with regex to match html <a> links_问答_开发者

Java .split() with regex to match html <a> links

开发者 https://www.devze.com 2023-04-12 14:19 出处：网络

I need to parse a string and escape all html tags except <a> links. For example: \"Hello, this is <b>A BOLD</b> bit and this is <a href=\"www.google.com\">a google</a>

相关专题：regex

I need to parse a string and escape all html tags except <a> links.

For example:

"Hello, this is <b>A BOLD</b> bit and this is <a href="www.google.com">a google</a> link"

When printed out in my jsp, I want to see the tags printed out as is (i.e. escaped so "A BOLD" is not actually in bold on the page) but the <a> link to be an actual link to google on the page.

I have got a little method that splits the incoming string based on a regex to match <a> links in various formats (with whites spaces, single or double quotes, etc). The regex is as follows:

myString.split("<a\\s[^>]*href\\s*=\\s*[\\\"\\|\\\'][^>]*[\\\"\\|\\\']\\s*>[^<\\/a>]*<\\/a>");

Yes it's horrid and probably hopelessly inefficient so open to alternative suggestions, but it does work up to a point. Where it falls down is parsing the link text bit. I want it to accept zero or more occurrences of any characters other than the </a> closing tag but it is parsing it as zero or more occurrences of any characters other than a "<" or "/" or "a" 开发者_StackOverflowor ">", i.e. as individual characters rather than the complete </a> word. So it matches with any text that has an "e" in it for example.

The bit in question is: [^<\\/a>]*

How do I change this to match on the entire word not it's constituent characters? I've tried parentheses etc but nothing works.

You can clean your HTML without ruining <a> tags by using the jsoup HTML Cleaner with a Whitelist:

String unsafe = 
    "<p><a href='http://example.com/' onclick='stealCookies()'>Link</a></p>";
String safe = Jsoup.clean(unsafe, Whitelist.addTags("a"));
// now: &lt;p&gtr;<a href="http://example.com/" rel="nofollow">Link</a>&lt;/p&gtr;

Although I agree with the consensual opinion that regex were not designed to parse x*ml, I feel that sometimes, you just haven't the time to learn, practice and implement new concepts and that a simple regex might well suffice in your case.

If you get enough time, learn xml parsers. Otherwise, here is an untested and maybe not userproof regex proposition to your problem (escape the slashes for java strings):

<\s*(?:[^aA]\b|[a-zA-Z0-9]{2,})[^>]*>

Which translates into:

<\s* # less-than character with optional space
(?:  # non capturing group of
  [^aA]\b         # a single letter which is not a nor A 
  |              # or
  [a-zA-Z0-9]{2,} # at least two alphanumeric characters
)
[^>]*> # ... anything until the first greater-than character