Encoding/decoding works in browser, but not in terminal_问答_开发者

Encoding/decoding works in browser, but not in terminal

开发者 https://www.devze.com 2023-03-16 17:51 出处：网络

Here\'s my code: import urllib print urllib.urlopen(\'http://www.indianexpress.com/news/heart-of-the-deal/811626/\').read().decode(\'iso-8859-1\')

Here's my code:

import urllib

print urllib.urlopen('http://www.indianexpress.com/news/heart-of-the-deal/811626/').read().decode('iso-8859-1')

When I view the page in Firefox, the text is displayed correctly. However, on the terminal, I see issue开发者_StackOverflow社区s with character encoding.

Here are some malformed output examples:

long-term  in
Indias
no-go areas

How can I fix this?

Try this (ignore unknown chars)

import urllib
url = 'http://www.indianexpress.com/news/heart-of-the-deal/811626/'
print urllib.urlopen(url).read().decode('iso-8859-1').encode('ascii','ignore')

You need to use the actual charset sent by the server instead of always assuming it's ISO 8859-1. Using a capable HTML parser such as Beautiful Soup can help.

The web-page lies; it is encoded in cp1252 aka windows-1252, NOT in ISO-8859-1.

>>> import urllib
>>> guff = urllib.urlopen('http://www.indianexpress.com/news/heart-of-the-deal/811626/').read()
>>> uguff = guff.decode('latin1')
>>> baddies = set(c for c in uguff if u'\x80' <= c < u'\xa0')
>>> baddies
set([u'\x93', u'\x92', u'\x94', u'\x97'])