开发者

python xml.dom parsing problems

开发者 https://www.devze.com 2023-03-31 00:06 出处:网络
I am writing a program in which the first step takes an URL address and opens the page. Then it puts the content into the xml.dom.minidom parser:

I am writing a program in which the first step takes an URL address and opens the page. Then it puts the content into the xml.dom.minidom parser:

from xml.dom.minidom import parse

page = urllib2.urlopen(page_url)
parser = parse(page)

The problem is that a lot of pages have mismatched tags and special characters so the parse method raises error. Also it raises error if there is any <br> and not <br />...

I tried like this:

from xml.dom.minidom import parseString

page = urllib2.urlopen(page_url)
data = ""
for line in page.readlines():
    data += str(line.replace("<br>", "<br />").replace(OTHER).replace...)
parser = parse(data)

But, this is just not a good solution.

So, is there any lib that is not so sensitive to mismatched tags and other errors in 开发者_运维问答html code?


I prefer lxml.html, it's very robust, and lxml in general is quite fast and has very nice capabilities, including XPath support.

import lxml.html

doc = lxml.html.parse('http://example.com')
0

精彩评论

暂无评论...
验证码 换一张
取 消