I am writing a program in which the first step takes an URL address and opens the page. Then it puts the content into the xml.dom.minidom parser:
from xml.dom.minidom import parse
page = urllib2.urlopen(page_url)
parser = parse(page)
The problem is that a lot of pages have mismatched tags and special characters so the parse method raises error. Also it raises error if there is any <br> and not <br />...
I tried like this:
from xml.dom.minidom import parseString
page = urllib2.urlopen(page_url)
data = ""
for line in page.readlines():
data += str(line.replace("<br>", "<br />").replace(OTHER).replace...)
parser = parse(data)
But, this is just not a good solution.
So, is there any lib that is not so sensitive to mismatched tags and other errors in 开发者_运维问答html code?
I prefer lxml.html, it's very robust, and lxml in general is quite fast and has very nice capabilities, including XPath support.
import lxml.html
doc = lxml.html.parse('http://example.com')
加载中,请稍侯......
精彩评论