开发者

How does hashing of entire content of a web page work?

开发者 https://www.devze.com 2023-03-02 09:53 出处:网络
I have sometimes heard esp in context of information retrieval,search engines,crawlers etc that we can detect duplicate pages by hashing content of a page. What kind of hash functions are able to hash

I have sometimes heard esp in context of information retrieval,search engines,crawlers etc that we can detect duplicate pages by hashing content of a page. What kind of hash functions are able to hash an entire web page (which are at least 2 pager开发者_StackOverflows), so that 2 copies have same hash output value?. What is size of a typical hash output value?

Are such hash functions able to put 2 similar web pages with slight typos etc in the same bucket?

Thanks,


Any hash function, given two inputs x and y s.t. x = y, will by definition return the same value for them. But if you want to do this kind of duplicate detection properly, you will need either:

  • a cryptographically strong hash function such as MD5, SHA-1 or SHA-512, which will practically never map two different pages to the same value so you can assume an equal hash value means equal input, or
  • a locality sensitive hash function if you want to detect near-duplicates.

Which one to use really depends on your needs; crypto hashes are useless in near-duplicate detection, since they're designed to map near-duplicates to very different values.


I think you’re looking for fuzzy hashing where only portions of document are hashed instead of the whole document at once.

0

精彩评论

暂无评论...
验证码 换一张
取 消