开发者

Best approach for building a custom web-crawler for finding sites with some arbitary text in the URL?

开发者 https://www.devze.com 2023-01-17 21:08 出处:网络
I would like to find all the sites, that have the keyword \'surfing waves\' somewhere in their address, very simple! But, without using ANY search engine, which means, writing a pure web-crawler.

I would like to find all the sites, that have the keyword 'surfing waves' somewhere in their address, very simple! But, without using ANY search engine, which means, writing a pure web-crawler.

The problems,I guess, I will face are:

  1. It will, obviously, never stop to run...
  2. It will come across lots of "garbage" sites before it even hit something that I wa开发者_如何学Cnt.
  3. It will probably run for ages until it finds the first 2000 sites...

Am I right? or in other words, Should I even try to do it this way? I dont want to use search engines because they limit the amount of results.


Search Engines limit the results in what sense? They are specifically for this purpose. To find things and you should use that. Even if you end up writing your own crawler, that crawler will need some starting points (start urls) to start crawling. May be you can use the search result from Google as that but then again you won't end up with a better result as most of the time (and after pretty long time) you will hit the same urls/addresses that are already part of the search result.


Web-crawlers are resource intensive for both parties - the site been crawled and the Web-crawler host itself. What you are trying to achieve is to have an inventory of sites that have certain keywords - so you are just interested in the results of a search engine. That's very limiting to a web-crawlers abilities.

It would be a better approach to first use the first few hundred search results pages to seed your webcrawler.

0

精彩评论

暂无评论...
验证码 换一张
取 消