Prevent Custom Web Crawler from being blocked_问答_开发者

Prevent Custom Web Crawler from being blocked

开发者 https://www.devze.com 2023-04-10 17:13 出处：网络

I am creating a new web crawler using C# to crawl some specific websites. every thing goes fine. but the problem is that some websites are blocking my crawler IP address after some requests. I tried u

I am creating a new web crawler using C# to crawl some specific websites. every thing goes fine. but the problem is that some websites are blocking my crawler IP address after some requests. I tried using timestamps between my crawl requests. but did not worked.

is there any way to prev开发者_JAVA百科ent websites from blocking my crawler ? some solutions like this would help (but I need to know how to apply them):

simulating Google bot or yahoo slurp
using multiple IP addresses (event fake IP addresses) as crawler client IP

any solution would help.

If speed/throughput is not a huge concern, then probably the best solution is to install Tor and Privoxy and route your crawler through that. Then your crawler will have a randomly changing IP address.

This is a very effective technique if you need to crawl sites that do not want you crawling them. It also provides a layer of protection/anonymity by making the activities of your crawler very difficult to trace back to you.

Of course, if sites are blocking your crawler because it is just going too fast, then perhaps you should just rate-limit it a bit.

And this is how you block fakers (just in case someone found this page while searching how to block those)

Block that trick in apache:

# Block fake google when it's not coming from their IP range's 
# (A fake googlebot) [F] => Failure
RewriteCond %{HTTP:X-FORWARDED-FOR} !^66\.249\.(6[4-9]|[78][0-9]|9[0-5])\.
RewriteCond %{HTTP_USER_AGENT} ^Mozilla/5\.0\ \(compatible;\ Googlebot/2\.[01];\ \+http://www\.google\.com/bot\.html\)$ [NC]
RewriteRule .* - [F,L]

Or a block in nginx for completeness sake

   map_hash_bucket_size  1024;
   map_hash_max_size     102400;

   map $http_user_agent $is_bot {
      default 0;
      ~(crawl|Googlebot|Slurp|spider|bingbot|tracker|click|parser|spider)$ 1;
   }

   geo $not_google {
      default     1;
      66.0.0.0/8  0;
   }

   map $http_user_agent $bots {
      default           0;
      ~(?i)googlebot       $not_google;
   }