开发者

How Do I Make Webpage Content Private To Humans But Public To Search Engines?

开发者 https://www.devze.com 2023-03-31 02:34 出处:网络
When you click on my client\'s search result in Google (or any other search engine) you\'re taken to the URL you were seeking but the content presented is a standard \'Terms of Use\' page.

When you click on my client's search result in Google (or any other search engine) you're taken to the URL you were seeking but the content presented is a standard 'Terms of Use' page.

A human needs to accept the 'Terms of Use' by clicking on a JS 'ok' link, which activates a cookie, and then they're allowed to see the actual page content.

Problem is this is making the page's content private and the search engines are consequently indexing the 'Terms of Use' content.

I'm looking for some sort of compromise that will satisfy the legal eagles and my client's SEO needs.

I'm not a developer but what I've come up with so far is ...

They c开发者_开发技巧ould set a cookie for requests coming from known search engines (using http://www.user-agents.org/index.shtml and/or www.iplists.com/nw/) and allow those requests to access the content.

This would make the private content public so they'd also need to noarchive those pages so people can't skip accepting the 'Terms of Use' and just access the content via Google's cache.

I believe this would allow the search engines to evaluate the page's content and rank it accordingly while still requiring humans to accept the site's 'Terms of Use' ?

First time I've come across this issue ... any advice on how to implement / better alternative solutions / live examples appreciated.

[There is a vaguely similar question but I'm looking for something a bit more specific please.]

Many thanks in anticipation!


A sufficiently smart human can just masquerade as the google-bot... anything you present to a bot can be seen by a human. This was great to do with expertSexchange - the answers were behind a pay-wall but if you just clicked on google's cached link you could see all the answers.

So in short: it won't work.


First of all, there is no way to securely identify that a request is coming from a search engine, so that anything you let a search engine see can be seen by any enterprising web surfer. I would say that the very first thing you must do is make sure that the client understands this. You can make something that works for the "default" user that isn't trying to bypass your system controls, but if you're going to let a search engine in to see content without authentication, then regular users will be able to follow that path too (with a little ingenuity).

Second, it is not wise to assume that a search crawler will support cookies at all. If you're only targeting one particular search engine, you could test it and see if it supports cookies, but from what I've read, most do not as it's just a lot more housekeeping on their end of things and they want to index what is freely available anyway. So, you can't use a cookie to keep track of a search engine request.

The only way I know of that you could let search engines in, but not regular default viewers is by sniffing the user agent string of the requesting agent. Each search engine will uniquely identify itself as such and you could look at that on each request you get and decide if they are allowed to bypass normal restrictions. But, just so you and your client know, any regular user could just configure their browser to include that user agent string and you'd be letting them right in - you can't really tell the difference. The Safari browser comes with the ability to control the user agent string (it's in their to help web developers with their own testing, but it can be used in other ways too).

In certain cases, it may be possible to look at the requesting IP address to see if it appears to be what you would expect from a search engine, but unless the search engine publishes the IP address ranges they would use and somewhat guarantees that these would be adhered to, this is a risky proposition to rely on.


An alternative might be to scrap the "Terms of Use" landing page altogether, and do what most sites do -- have a site usage warning:

By continuing to use this site, you agree to the
<a href="ToU.htm">Terms of Use</a>

If it has to be really prominent, you could make it similar to the Stackoverflow notification bar at the top of the page.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号