开发者

What techniques are there to extract a navigational menu from a web page?

开发者 https://www.devze.com 2023-03-01 15:22 出处:网络
I\'m looking for a method to extract a menu used for navigation from a web page heavy with links (and probably text). The pages I\'m interested in are quite plain, valid XHTML, and it\'s a safe assump

I'm looking for a method to extract a menu used for navigation from a web page heavy with links (and probably text). The pages I'm interested in are quite plain, valid XHTML, and it's a safe assumption that the menu is somewhere in the beginning or the end of the page. But a good, general, method to find where exactly it is has eluded me so far - and I hope you'll be able to help me with this.

A quick note: I'm not looking for something like Readability - to find the main article and strip everything else, but for som开发者_JAVA百科ething to specifically find the menu. Also the naive method of "find an element that has a lot of links as successors" doesn't work very well - as the pages I tend to contain pretty long lists of links.

EDIT: I need the menu to get the content of the pages linked in it (I building a web scraper of sorts for an Information Extraction project). Some example pages I works with:

  • http://p2.cs.berkeley.edu/
  • http://www.cs.cornell.edu/bigreddata/maybms/ (note: here I need the menu which points to publications/downloads not the sidebar navigation, but getting rid of the side bar navigation is easier using something like Readability).


I would compute the ratio of {sum of lengths of child element text in links} over {sum of lengths of child element text out of links}. If the ratio is above some threshold, and the absolute number of links is above some threshold, then you can assume that element contains a menu.

If that isn't enough you'd have to render the page (in a browser, or headless using a webkit library for example) to get the position on the page of the rendered elements.


As Drag0nR3b0rn mentioned you should use link / non link text ratio + common menu words as features and manually/trained decision tree. For crawling I would reccommend HtmlUnit

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号