Python urllib2 decode chunked encoding_问答_开发者

开发者 https://www.devze.com 2023-03-31 13:12 出处：网络

I have the following code to open and read URLs: html_data = urllib2.urlopen(req).read() and I believe this is the most standard way to read data from HTTP.

I have the following code to open and read URLs:

html_data = urllib2.urlopen(req).read()

and I believe this is the most standard way to read data from HTTP. However, when the response have chunked tranfer-encoding, the response starts with the following characters:

1eb0\r\n2625\r\n
<?xml version="1.0" encoding="UTF-8"?>
...

This happens due to the mentioned above chunked encoding and thus my XML data become开发者_开发知识库s corrupted.

So I wonder how I can get rid of all meta-data related to the chunked encoding?

I ended up with custom xml stripping, like this:

    xml_start = html_data.find('<?xml')
    xml_end = html_data.rfind('</mytag>')
    if xml_start !=0:
        log_user_action(req.get_host() ,'chunked data', html_data, {})
        html_data = html_data[xml_start:]
    if xml_end != len(html_data)-len('</mytag>')-1:
        html_data = html_data[:xml_end+1]

Can't find any simple solution.

1eb0\r\n2625\r\n is the segment start/stop positions (in hex) in the reassembled payload

You can remove everything before ?xml

html_data = html_data[html_data.find('<?xml'):]

Python urllib2 decode chunked encoding

精彩评论

关注公众号

热门标签

图文推荐

Python urllib2 decode chunked encoding

更多 问答 相关资讯：

精彩评论

关注公众号

热门标签

图文推荐

更多问答相关资讯：