开发者

Does lxml parse HTML contextually?

开发者 https://www.devze.com 2023-03-30 19:48 出处:网络
I\'m using lxml to parse HTML: >>> from lxml.html import fromstring, tostring It parses trailing whitespace correctly in some cases:

I'm using lxml to parse HTML:

>>> from lxml.html import fromstring, tostring

It parses trailing whitespace correctly in some cases:

>>> html = """<div>some <i>text</i> </div>"""
>>> html == tostring(fromstring(html))
True

But it seems to break when encountering unknown tags (such as the blah tag below).

>>> html = """<div>so开发者_运维问答me <blah>text</blah> </div>"""
>>> html == tostring(fromstring(html))
False

How can I fix it to include trailing whitespace for all tags?


This appears to be due to the behavior of libxml2 (I've removed some error messages from the version below):

>>> print libxml2.htmlParseDoc("""<div>some <blah>text</blah> </div>""", "UTF-8")
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><div>some <blah>text</blah></div></body></html>


>>> print libxml2.htmlParseDoc("""<div>some <i>text</i> </div>""", "UTF-8")
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><div>some <i>text</i> </div></body></html>

I am still probing for a workaround. libxml2's XML parser doesn't exhibit this behavior, but I think it would work a lot worse on broken html.


You need to set a flag in the parser itself to remove whitespace. I've done this when parsing xml like this:

from lxml import etree

parser = etree.XMLParser(remove_blank_text=True)

data = etree.parse(open(file),parser)
0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号