How to get cyrillic string from document?
I have fallowing code:
import urllib
from BeautifulSoup import BeautifulSoup
page = urllib.urlopen("http://habrahabr.ru/")
soup = BeautifulSoup(page.read())
for topic in soup.findAll(True, 'topic'):
print topic
print
raw_input()
There is cyrillic words on the site but python displays wrong characters.
I will be very helpful for any help in this is开发者_如何学JAVAsue.
PS.
I changed
soup = BeautifulSoup(page.read())
to
soup = BeautifulSoup(page.read(), fromEncoding="utf-8")
and still no results...
The data on the HTML page is encoded in UTF-8. It appears that you are printing it to your console, where sys.stdout.encoding is cp1251. That accounts for the rubbish that you are seeing.
Here are the results of inspecting the first 8 bytes of the first topic, using IDLE:
>>> raw = '\xd0\x90\xd0\xbb\xd0\xb3\xd0\xbe'
>>> print raw.decode('utf8')
Алго
>>> print raw.decode('cp1251')
Алго
>>>
Thanks for help.
I solve issue with this code:
print str(topic).decode('utf8')
in django i solved it this way:
from django.utils.encoding import force_unicode
print ("%s" % force_unicode(topic, encoding='utf-8', strings_only=False, errors='strict'))
so you can grab this function from django
精彩评论