Classifying websites_问答_开发者_运维开发者技术经验分享

开发者 https://www.devze.com 2023-02-14 00:19 出处：网络

I need to scrape a thousand websites that share the same structure: they all have a menu, a title, some text and a rating, much like a blog. Unfortunately, they are also coded very differently and some are manually, so I cannot reutilize CSS selectors开发者_如何转开发, and perhaps not even rely upon them.

I wonder how I can automatically classify them and save what is left of my hair. My first guess is to use lynx, or something text browser, to get some blocks of text and classify them according to their size.

Do you know of a better or more sophisticate approach?

Thank you!

Look into http://code.google.com/p/boilerpipe/ to disassemble pages.

For classification, look, perhaps, at mahout.apache.org.

My suggestion is to split the problem into two main parts.

Write the classification part as if all of the web sites were coded identically, with all the same structure.

Then write the scraping part so that it finds the actual structure of each web site, and maps that structure to your ideal structure from the classification part.

Classifying websites

精彩评论

关注公众号

热门标签

图文推荐

Classifying websites

更多 问答 相关资讯：

精彩评论

关注公众号

热门标签

图文推荐

更多问答相关资讯：