开发者

Classifying websites

开发者 https://www.devze.com 2023-02-14 00:19 出处:网络
I need to scrape a thousand websites that share the same structure: they all have a menu, a title, some text and a rating, much like a blog. Unfortunately, they are also coded very differently and som

I need to scrape a thousand websites that share the same structure: they all have a menu, a title, some text and a rating, much like a blog. Unfortunately, they are also coded very differently and some are manually, so I cannot reutilize CSS selectors开发者_如何转开发, and perhaps not even rely upon them.

I wonder how I can automatically classify them and save what is left of my hair. My first guess is to use lynx, or something text browser, to get some blocks of text and classify them according to their size.

Do you know of a better or more sophisticate approach?

Thank you!


Look into http://code.google.com/p/boilerpipe/ to disassemble pages.

For classification, look, perhaps, at mahout.apache.org.


My suggestion is to split the problem into two main parts.

Write the classification part as if all of the web sites were coded identically, with all the same structure.

Then write the scraping part so that it finds the actual structure of each web site, and maps that structure to your ideal structure from the classification part.

0

精彩评论

暂无评论...
验证码 换一张
取 消