hierarchy in sites_问答_开发者_运维开发者技术经验分享

开发者 https://www.devze.com 2023-03-16 01:49 出处：网络

I\'m not sure if this question will have a single answer or even a concise one for all answer but I thought I would ask non the less. The problem isn\'t language specific either but may have some sort

I'm not sure if this question will have a single answer or even a concise one for all answer but I thought I would ask non the less. The problem isn't language specific either but may have some sort of pseudo algorithm as an answer.

Basically I'm trying to learn about how spiders work and from what I can tell no spider I've found manages hierarchy. They just list the content or the links but no ordering.

My question is this: we look at a site and can easily determine visually what links are navigational, content related or external to a site. How could we automate this? How could we pro grammatically help a spider detemine parent and child pages.

Of course the first answer would be to use the URL's directory structure. E.g www.stackoverflow.com/questions/spiders spiders is child of questions, questions is child of base site and so on. But nowadays hierarchy is usually flat with ids being referenced in URL.

So far I have 2 answers to this question and would love some feedback.

1: Occurrence.

The links that occur the most in all pages would be dubbed as navigational. This seems like the most promising design but I can see issues popping up with dynamic links and others but they seem minuscule.

2: Depth.

Example is how many times do I need to click on a site to get to a certain page. This seems doable but if some information is advertised on the home page that is actually 开发者_JS百科on the bottom level, it would be determined as a top level page or node.

So has anyone got any thoughts or constructive criticism on how to make a spider judge hierarchy in links.

(If anyone is really curious, the back end part of the spider will most likely be Ruby on rails)

What is your goal? If you want to crawl smaller number of websites and extract useful data for some kind of aggregator, its best to build focused crawler(Write crawler for every site).

If you want to crawl milion of pages ... Well than you must be very familiar with some advanced concepts from AI.

You can start from this article http://www-ai.ijs.si/SasoDzeroski/ECEMEAML04/presentations/076-Znidarsic.pdf