I am writing a program in which the first step takes an URL
address and opens the page. Then it puts the content into the xml.dom.minidom
parser:
from xml.dom.minidom import parse
page = urllib2.urlopen(page_url)
parser = parse(page)
The problem is that a lot of pages have mismatched tags and special characters so the parse method raises error. Also it raises error if there is any <br>
and not <br />
...
I tried like this:
from xml.dom.minidom import parseString
page = urllib2.urlopen(page_url)
data = ""
for line in page.readlines():
data += str(line.replace("<br>", "<br />").replace(OTHER).replace...)
parser = parse(data)
But, this is just not a good solution.
So, is there any lib that is not so sensitive to mismatched tags and other errors in 开发者_运维问答html code?
I prefer lxml.html
, it's very robust, and lxml
in general is quite fast and has very nice capabilities, including XPath support.
import lxml.html
doc = lxml.html.parse('http://example.com')
精彩评论