开发者

How to ignore web crawlers?

开发者 https://www.devze.com 2023-03-25 15:11 出处:网络
I have a page that count how many times is visited by a user (registered, guest, every kind of users...).

I have a page that count how many times is visited by a user (registered, guest, every kind of users...).

So I update a field on the database every time the page is viewed; yes, also if the page is refreshed quickly, but I don't mind about this.

Of course, when some bots/crawler scans my website they will increment this value, and I'll get rid about this. So, is there a list of IP addresses to ignore? Or some mechanism that can he开发者_如何学编程lp me to do it?


Another way to do it is with ajax. Most crawlers don't parse javascript.


IP addresses can change so it's not be the best way to detect whether or not a visitor is a bot. Instead, I suggest looking at the user-agent string in the HTTP request parameters.

Here's a list of user-agent strings: http://www.user-agents.org/ . Look specifically under the type R for "robots, crawler, spider".


Most people don't have a static IP address. Have you setup a robots.txt to deny access to crawlers/bots? You could periodically query your log files to try and identify those which don't respect robots.txt, though the user agent is easily spoofed/changed.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号