开发者

libxml2 fails to handle CDATA in HTML correctly

开发者 https://www.devze.com 2023-02-01 10:43 出处:网络
I\'m using libxml2.2.7.3 to parse html pages and I\'m having difficulties getting it work correctly with CDATA in HTML. Here\'s the code:

I'm using libxml2.2.7.3 to parse html pages and I'm having difficulties getting it work correctly with CDATA in HTML. Here's the code:

xmlDocPtr doc = htmlReadMemory(data, length, "", NULL开发者_运维百科, 0);
xmlBufferPtr buffer = xmlBufferCreate();
xmlNodeDump(buffer, doc, doc->children, 0, 0);
printf("%s", (char*)buffer->content);

and the HTML data:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html><body>
  <div>
    <script type="text/javascript"> 
    //<![CDATA[
      document.write('</div>');
    //]]>
    </script>
  </div>
</body></html>

The parser erroneously recognizes the </div> inside the quotes as a real html tag and prints out error messages as follows:

:8: HTML parser error : Unexpected end tag : script
    </script>
             ^
:9: HTML parser error : Unexpected end tag : div
  </div>
        ^

And the result printed out and debugging also imply that parsing went wrong:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html><body>
  <div>
    <script type="text/javascript"><![CDATA[ 
    //<![CDATA[
      document.write(']]></script></div>');
    //]]>


</body></html>

So the question is, is this a bug of libxml2? Or am I doing something wrong?

Any insightful advices would be greatly appreciated. Thanks!


In HTML, the <script> element contains CDATA by definition, so <![CDATA[ has no effect.

In short, the source document is broken.

That section would be more properly written as:

<script type="text/javascript"> 
  document.write('<\/div>');
</script>
0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号