开发者

How to get cyrillic string from document

开发者 https://www.devze.com 2023-02-13 02:11 出处:网络
How to get cyrillic string from document? I have fallowing code: import urllib from BeautifulSoup import BeautifulSoup

How to get cyrillic string from document?

I have fallowing code:

import urllib
from BeautifulSoup import BeautifulSoup

page = urllib.urlopen("http://habrahabr.ru/")
soup = BeautifulSoup(page.read())
for topic in soup.findAll(True, 'topic'):
    print topic
    print
raw_input()

There is cyrillic words on the site but python displays wrong characters.

I will be very helpful for any help in this is开发者_如何学JAVAsue.

PS.

I changed

soup = BeautifulSoup(page.read()) 

to

soup = BeautifulSoup(page.read(), fromEncoding="utf-8") 

and still no results...


The data on the HTML page is encoded in UTF-8. It appears that you are printing it to your console, where sys.stdout.encoding is cp1251. That accounts for the rubbish that you are seeing.

Here are the results of inspecting the first 8 bytes of the first topic, using IDLE:

>>> raw = '\xd0\x90\xd0\xbb\xd0\xb3\xd0\xbe'
>>> print raw.decode('utf8')
Алго
>>> print raw.decode('cp1251')
Алго
>>> 


Thanks for help.

I solve issue with this code:

print str(topic).decode('utf8')


in django i solved it this way:

from django.utils.encoding import force_unicode
print ("%s" % force_unicode(topic, encoding='utf-8', strings_only=False, errors='strict'))

so you can grab this function from django

0

精彩评论

暂无评论...
验证码 换一张
取 消