How to get cyrillic string from document_问答_开发者

How to get cyrillic string from document

开发者 https://www.devze.com 2023-02-13 02:11 出处：网络

How to get cyrillic string from document? I have fallowing code: import urllib from BeautifulSoup import BeautifulSoup

相关专题：parsing python

How to get cyrillic string from document?

I have fallowing code:

import urllib
from BeautifulSoup import BeautifulSoup

page = urllib.urlopen("http://habrahabr.ru/")
soup = BeautifulSoup(page.read())
for topic in soup.findAll(True, 'topic'):
    print topic
    print
raw_input()

There is cyrillic words on the site but python displays wrong characters.

I will be very helpful for any help in this is开发者_如何学JAVAsue.

PS.

I changed

soup = BeautifulSoup(page.read())

soup = BeautifulSoup(page.read(), fromEncoding="utf-8")

and still no results...

The data on the HTML page is encoded in UTF-8. It appears that you are printing it to your console, where sys.stdout.encoding is cp1251. That accounts for the rubbish that you are seeing.

Here are the results of inspecting the first 8 bytes of the first topic, using IDLE:

>>> raw = '\xd0\x90\xd0\xbb\xd0\xb3\xd0\xbe'
>>> print raw.decode('utf8')
Алго
>>> print raw.decode('cp1251')
РђР»РіРѕ
>>>

Thanks for help.

I solve issue with this code:

print str(topic).decode('utf8')

in django i solved it this way:

from django.utils.encoding import force_unicode
print ("%s" % force_unicode(topic, encoding='utf-8', strings_only=False, errors='strict'))

so you can grab this function from django

How to get cyrillic string from document

精彩评论

关注公众号

热门标签

图文推荐

How to get cyrillic string from document

更多 问答 相关资讯：

精彩评论

关注公众号

热门标签

图文推荐

更多问答相关资讯：