Given a URL, the URL of the webpage that first URL is on, the DOM of the webpage, and a li开发者_如何学JAVAst of the rest of the URLs on the webpage how can I reliably determine if the URL is in the header/footer of the page or if it's in neither?
I'm using C#/.NET.
I know that no solution is perfect since webpages are not semantically expressed and also because some websites/pages specifically obfuscate their pages, but I would like to build some logic that would work for say 75% of webpages.
Also, are there other pieces of information that would be helpful to determine the location of the URL in the page?
I think the creative task here is to define "header" and "footer", as in "content less than x units away from the top", or "the last 200 characters on the page". Once you have accomplished this, you can parse the page based on those rules.
精彩评论