开发者

How to make a Regex Pattern for HTML Simple Text?

开发者 https://www.devze.com 2023-01-30 04:34 出处:网络
I am trying to learn Regex patterns for a class.I am making a simple HTML Lexer/Parser.I know this is not the best or most efficient way to make a Lexer/Parser but it is only to understand R开发者_如何

I am trying to learn Regex patterns for a class. I am making a simple HTML Lexer/Parser. I know this is not the best or most efficient way to make a Lexer/Parser but it is only to understand R开发者_如何转开发egex patterns.

So my question is, How do I create a pattern that checks if the String does not contain any HTML tags (ie <TAG>) and does not contain any HTML Entities (ie &ENT;)?

This is what I could come up with so far but it still does not work:

.+?(^(?:&[A-Za-z0-9#]+;)^(?:<.*?>))

EDIT: The only problem is that I can't negate the final outcome I need to find a complete pattern that would accomplish this task if it's possible, although it might not be pretty. I never mentioned but it's pretty much supposed to match any Simple Text in an HTML page.


You could use the expression <.+?>|&.+?; to search for a match, and then negate the result.

  • <.+?> says first a < then anything (one or more times) then a >
  • &.+?; says first a & then anything (one or more times) then a ;

Here is a complete example with an ideone.com demo here.

import java.util.regex.*;

public class Test {
    public static void main(String[] args) {
        String[] tests = { "hello", "hello <b>world</b>!", "Hello&nbsp;world" };
        Pattern p = Pattern.compile("<.+?>|&.+?;");
        for (String test : tests) {
            Matcher m = p.matcher(test);
            if (m.find())
                System.out.printf("\"%s\" has HTML: %s%n", test, m.group());
            else
                System.out.printf("\"%s\" does have no HTML%n", test);
        }
    }
}

Output:

"hello" does have no HTML
"hello <b>world</b>!" has HTML: <b>
"Hello&nbsp;world" has HTML: &nbsp;


If you're looking to match strings that do NOT follow a pattern, the simplest thing to do is to match the pattern and then negate the result of the test.

<[^>]+>|&[^;]+;

Any string that matches this pattern will have AT LEAST ONE tag (as you've defined it) or entity (as you've defined it). So the strings you want are strings that DO NOT match this pattern (they will have NO tags or entities).

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号