开发者

Getting unicode from a urllib request

开发者 https://www.devze.com 2023-03-28 00:25 出处:网络
I am running the following code trying to find particular information in some HTML.I am having an encoding/decoding problem, however, that I cannot resolve.

I am running the following code trying to find particular information in some HTML. I am having an encoding/decoding problem, however, that I cannot resolve.

import urllib
req = urllib.urlopen('http://securities.stanford.edu/1046/AAI00_01/')
html = req.read()
type(html)
#   <type 'str'>
html.upper().find('HTML')
#   -1
print html[0:20]
#   ??<HTML><HE
html[0:10]
#   '\xff\xfe<\x00H\x00T\x00M\x00'
req.headers['content-type']
#   'text/html'
html = html.encode('utf-8')
#   Traceback (most recent call last):
#   File "<stdin>", line 1, in <module>
#   UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 0: ordinal not in range(128)

What is the solution to this problem? All I need to do is scrape some information from the page using .find and re开发者_开发问答gular expressions.

I am using Mac OSX and running Python 2.6.1 from within Terminal.


If you're trying to convert from the str you have to a unicode, you want to use html.decode, not encode.

Older, bad advice: Also, since you seem to have a BOM at the beginning there, you probably want to use 'utf_8_sig' as the encoding, which will strip the BOM on decode.

New, better advice: Actually, from seeing all those \x00's in the output along with the BOM, it looks more like the encoding is actually UTF-16, not UTF-8. So, html.decode('utf-16') should be the way to go.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号