开发者

Can page scraping be detected?

开发者 https://www.devze.com 2023-03-24 22:42 出处:网络
开发者_开发百科So I just created an application that does page scraping for me, and ran it. It worked fine. I was wondering if someone would be able to figure out that the code was being page scraped,
开发者_开发百科

So I just created an application that does page scraping for me, and ran it. It worked fine. I was wondering if someone would be able to figure out that the code was being page scraped, whether or not they had written code for that purpose?

I wrote the code in java, and it's pretty much just checking for one line of the html code.

I thought I'ld get some insight on that before I add anymore code to this program. I mean it's useful, and all, but it's almost like a hack.

Seems like the worst case scenario as a result of this page scraper isn't too bad as I can just use another device later and the IP will be different. Also it might not matter in a month. The website seems to be getting quite a lot of web traffic anyways at the moment. Whoever edits the page is probably asleep now, and it really hasn't accomplished anything at this point so this could go unnoticed.

Thanks for such fast responses. I think it might have gone unnoticed. All I did was copy a header, so just text. I guess that is probably similar to how browser copy-paste works. The page was just edited this morning, including the text I was trying to get. If they did notice anything, they haven't announced it, so all is good.


It is a hack. :)

There's no way to programmatically determine if a page is being scraped. But, if your scraper becomes popular or you use it too heavily, it's quite possible to detect scraping statistically. If you see one IP grab the same page or pages at the same time every day, you can make an educated guess. Same if you see requests on another timer.

You should try to obey the robots.txt file if you can, and rate limit yourself, to be polite.


As a sysadmin myself, yes I'd probably notice but ONLY based on the behavior of the client. If a client had a weird user agent, I'd be suspicious. If a client browsed the site too quickly or in very predictable intervals, I'd be suspicious. If certain support files were never requested (favicon.ico, various linked in CSS and JS files), I'd be suspicious. If the client were accessing odd (not directly accessible) pages, I'd be suspicious.

Then again I'd have to actually be looking at my logs. And this week Slashdot has been particularly interesting, so no I probably wouldn't notice.


It depends on how have you implemented this and how smart are the detection tools.

First take care about User-Agent. If you do not set it explicitly it will be something like "Java-1.6". Browsers send their "unique" user agents, so you can just mimic the browser behavior and send User-Agent of MSIE, or FireFox (for example).

Second, check other HTTP headers. Probably some browsers send their specific headers. Take one example and follow it, i.e. try to add the headers to your requests (even if you do not need them).

Human user acts relatively slowly. Robot may act very quickly, i.e. retrieve the page and then "click" link, i.e. perform yet another HTTP GET. Put random sleep between these operations.

Browser retrieves not only the main HTML. Then it downloads images and other stuff. If you really do not want to be detected you have to parse HTML and download this stuff, i.e. actually be "browser".

And the last point. It is obviously not your case but it is almost impossible to implement robot that passes Capcha. This is yet another way to detect robot.

Happy hacking!


If your scraper acts like a human then there is a hardly any chance for it to be detected as a scraper. But if your scraper acts like a robot then its not difficult to be detected.

To act like a human you will need to:

  1. Look at what a browser sends in the HTTP headers and simulate them.

  2. Look at what a browser requests for when accessing the page and access the same with the scraper

  3. Time your scraper to access at the speed of a normal user

  4. Send requests at random intervals of time instead of at fixed intervals

  5. If possible make requests from a dynamic IP rather than a static one


assuming you wrote the page scraper in a normal manner, i.e., it fetches the whole page and then does pattern recognition to extract what you want from the page, all someone might be able to tell is that the page was fetched by a robot rather than a normal browser. all their logs will show is that the entire page was fetched; they can't tell what you do with it once it's in your RAM.


To the server serving the page, there's no difference whether you download a page into the browser or download a page and screen scrape it. Both actions just require an HTTP request, whatever you do with the resulting HTML on your end is none of the server's business.

Having said that, a sophisticated server could conceivably detect activity that doesn't look like a normal browser. For example, a browser should request any additional resources linked to from the page, something that usually doesn't happen when screen scraping. Or requests with an unusual frequency coming from a particular address. Or simply the HTTP User-Agent header.

Whether a server tries to detect these things or not depends on the server, most don't.


I'd like to put my two cents in for others that may be reading this. In the past couple of years web scraping has been frowned upon more and more by the court system. I've cited a lot of examples in a blog post I recently wrote.

You should definitely abide the robots.txt but also look at the websites T&C's to make sure you are not in violation. There are definitely ways that people can identify you are web scraping and there could be potential consequences for doing so. In the event that web scraping is not disallowed by the website's Terms and Conditions, then have fun but make sure to still be conscionable. Dont destroy a webserver with an out of control bot, throttle yourself to make sure you dont impact the server!

For full disclosure, I am a co-founder of Distil Networks and we help companies identify and stop web scrapers and bots.

0

精彩评论

暂无评论...
验证码 换一张
取 消