开发者

Parsing HTML from a web page

开发者 https://www.devze.com 2023-02-06 04:57 出处:网络
I have to extract some information from a web page, and reformat it for the user. Since the web page is somewhat regular, now I use HttpClient to retrive the HTML as a string, and I extrac开发者_开发

I have to extract some information from a web page, and reformat it for the user.

Since the web page is somewhat regular, now I use HttpClient to retrive the HTML as a string, and I extrac开发者_开发问答t substrings in given locations with the relevant data.

Anyhow I'm wondering if there is a better way, maybe an HTML-aware way. How would you do it?

Cheers


Ideally, you should use a real HTML-parser. I've used Jsoup successfully in the past on Android:

http://jsoup.org/


I personally like to use Jericho parser: http://jericho.htmlparser.net/docs/index.html

It is easy to use, have very much examples on project's page and deals good with pure HTML (unclosed tags etc.).


We've used HTTPUnit do do this in the past.


jsoup.org is better but Cobra have also some addidtional features (CSS-aware and JavaScript-aware).

0

精彩评论

暂无评论...
验证码 换一张
取 消