Is this not a suitable scenario for an Html parser?_问答_开发者

Is this not a suitable scenario for an Html parser?

开发者 https://www.devze.com 2023-03-22 18:52 出处：网络

I have to deal with malformed Html and Html tags inside Html attributes: <p class=\"<sometag attr=\"something\"></sometag>\">

I have to deal with malformed Html and Html tags inside Html attributes:

<p class="<sometag attr="something"></sometag>">   
    <a href="<someothertag></someothertag">Link</a>
</p>

I tried using HtmlAgilityPack to parse out the content but when you load the above code into an HtmlDocument, the OuterHtml outputs:

<p class="<sometag开发者_如何学Go attr=" something"="">">
    <a href="<someothertag></someothertag">Link</a>
</p>

The p tag becomes malformed and the someothertag inside the href attribute of the a tag is not recognized as a node (although it's really text inside an attribute, I would like it to be recognized as a tag).

Is there something else I can use to help me parse bad Html like this?

it's not valid html, so i don't think you can rely on an html parser to parse it.

You may be asking a lot of a parser since this is probably a rare case. You may need to solve this on your own.

The major problem I see is that there are sets of double quotes within the attribute value. Is it guaranteed that the markup will always have a matching closing character for every opening? In other words, for every < will there be a > and for every opening " or ', a matching closing mark?

If that's the case, my suggestion would be taking the source for an HTML parser such as Html Agility Pack and adding some functionality to the attribute parsing. Use a stack; for every opening character, push it, then read until you find another opening or closing character. If it's opening, push it, if it's closing, pop it.

Alternately, you could add detection for the less-than and greater-than characters in the attribute value and not recognize the end of the attribute value until all the contained tags are closed.

One other possible solution is to modify the source markup before passing it to the parser and changing the illegal characters in the attribute values to escaped characters (ampersand-semicolon). Unfortunately, this would require doing some preliminary parsing on your part.