开发者

Python urllib2 decode chunked encoding

开发者 https://www.devze.com 2023-03-31 13:12 出处:网络
I have the following code to open and read URLs: html_data = urllib2.urlopen(req).read() and I believe this is the most standard way to read data from HTTP.

I have the following code to open and read URLs:

html_data = urllib2.urlopen(req).read()

and I believe this is the most standard way to read data from HTTP. However, when the response have chunked tranfer-encoding, the response starts with the following characters:

1eb0\r\n2625\r\n
<?xml version="1.0" encoding="UTF-8"?>
...

This happens due to the mentioned above chunked encoding and thus my XML data become开发者_开发知识库s corrupted.

So I wonder how I can get rid of all meta-data related to the chunked encoding?


I ended up with custom xml stripping, like this:

    xml_start = html_data.find('<?xml')
    xml_end = html_data.rfind('</mytag>')
    if xml_start !=0:
        log_user_action(req.get_host() ,'chunked data', html_data, {})
        html_data = html_data[xml_start:]
    if xml_end != len(html_data)-len('</mytag>')-1:
        html_data = html_data[:xml_end+1]

Can't find any simple solution.


1eb0\r\n2625\r\n is the segment start/stop positions (in hex) in the reassembled payload


You can remove everything before ?xml

html_data = html_data[html_data.find('<?xml'):]
0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号