开发者

Universal "HTTP GET html page content and recode to UTF-8" procedure

开发者 https://www.devze.com 2023-03-30 10:39 出处:网络
For some time I have been trying to solve fairly common problem consisting of basically three steps: fetch html page with specified URL and store its content in a String

For some time I have been trying to solve fairly common problem consisting of basically three steps:

  1. fetch html page with specified URL and store its content in a String
  2. detect content encoding either from html meta information or HTTP header
  3. recode the content into UTF-8 for further processing

In the real usage I have the first step a little extended with functionalities like having a "user-agent" instance with cookie-jar, configurable timeout and amount of GET attempts, configurable reque开发者_运维技巧st count per time frame limitation, etc...

I implemented rest-client wrapper but I run into several problems:

  • class-global RestClient.proxy settings conflicting with e.g. couchrest (using rest-client itself)
  • freezing: sometimes the timeout causes freezing of the process. AFAIK more of my friends run into the same problem with rest-client
  • redirect Location URI parsing: rest-client fails to fetch "http://www.ofertacarioca.com.br/index.aspx?cidade=4,Belo%20Horizonte" correctly complaining about invalid URI '/indexnew.aspx?cidade=4,Belo Horizonte' in Location header of the 302 result but curb handles this perfectly through to the target page. I'm about to reimplement the wrapper with the use of curb
  • recoding problems in the third step: I attempted to detect encoding from html page meta information and HTTP header (in this order) for some pages still to no avail

I would love to know of some cool gem out there that would handle such needs or of some intriguing universal solution hints if any.


As nobody has answered, I needed to implement the curb-based solution: curburger

Perhaps somebody finds it useful.

0

精彩评论

暂无评论...
验证码 换一张
取 消