python xml.dom parsing problems_问答_开发者_运维开发者技术经验分享

开发者 https://www.devze.com 2023-03-31 00:06 出处：网络

I am writing a program in which the first step takes an URL address and opens the page. Then it puts the content into the xml.dom.minidom parser:

I am writing a program in which the first step takes an URL address and opens the page. Then it puts the content into the xml.dom.minidom parser:

from xml.dom.minidom import parse

page = urllib2.urlopen(page_url)
parser = parse(page)

The problem is that a lot of pages have mismatched tags and special characters so the parse method raises error. Also it raises error if there is any <br> and not <br />...

I tried like this:

from xml.dom.minidom import parseString

page = urllib2.urlopen(page_url)
data = ""
for line in page.readlines():
    data += str(line.replace("<br>", "<br />").replace(OTHER).replace...)
parser = parse(data)

But, this is just not a good solution.

So, is there any lib that is not so sensitive to mismatched tags and other errors in 开发者_运维问答html code?

I prefer lxml.html, it's very robust, and lxml in general is quite fast and has very nice capabilities, including XPath support.

import lxml.html

doc = lxml.html.parse('http://example.com')

python xml.dom parsing problems

精彩评论

关注公众号

热门标签

图文推荐

python xml.dom parsing problems

更多 问答 相关资讯：

精彩评论

关注公众号

热门标签

图文推荐

更多问答相关资讯：