Protecting website content from crawlers_问答_开发者

Protecting website content from crawlers

开发者 https://www.devze.com 2023-03-18 21:03 出处：网络

The contents of a commerce website (ASP.NET MVC) are regularly crawled by the competition. These people are programmers and they use sophisticated methods to 开发者_StackOverflow中文版crawl the site so identifying them by IP is not possible. Unfortunately replacing values with images is not an option because the site should still remain readable by screen readers (JAWS).

My personal idea is using robots.txt: prohibit crawlers from accessing one common URL on the page (this could be disguised as a normal item detail link, but hidden from normal users Valid URL: http://example.com?itemId=1234 Prohibited: http://example.com?itemId=123 under 128). If an IP owner entered the prohibited link show a CAPTCHA validation. A normal user would never follow a link like this because it is not visible, Google does not have to crawl it because it is bogus. The issue with this is that the screen reader still reads the link and I don't think that this would be so effective to be worth implementing.

Your idea could possibly work for a few basic crawlers but would be very easy to work around. They would just need to use a proxy and do a get on each link from a new IP.

If you allow anonymous access to your website then you can never fully protect your data. Even if you manage to prevent crawlers with lots of time and effort they could just get a human to browse and capture the content with something like fiddler. The best way to prevent your data being seen by your competitors would be to not put it on a public part of your website.

Forcing users to log in might help matters, at least then you could pick up who is crawling your site and ban them.

As mentioned, its not really going to be possible to hide publicly accessible data from a determined user, however, as these are automated crawlers, you could make life harder for them by altering the layout of your page regularly.

It is probably possible to use different master pages to produce the same (or similar) layouts, and you could swap in the master page on a random basis - this would make the writing of an automated crawler that bit more difficult.

I am about to get to the phase of protecting my content from crawlers either.

I am thinking of limiting what an anonymous user can see of the website and require them to register for a full functionality.

example:

public ActionResult Index()
{
    if(Page.User.Identity.IsAuthorized)
        return RedirectToAction("IndexAll");

    // show only some poor content
}

[Authorize(Roles="Users")]
public ActionResult IndexAll()
{
    // Show everything
}

Since you know users now, you can punish any crawler.