I have sometimes heard esp in context of information retrieval,search engines,crawlers etc that we can detect duplicate pages by hashing content of a page. What kind of hash functions are able to hash an entire web page (which are at least 2 pager开发者_StackOverflows), so that 2 copies have same hash output value?. What is size of a typical hash output value?
Are such hash functions able to put 2 similar web pages with slight typos etc in the same bucket?
Thanks,
Any hash function, given two inputs x and y s.t. x = y, will by definition return the same value for them. But if you want to do this kind of duplicate detection properly, you will need either:
- a cryptographically strong hash function such as MD5, SHA-1 or SHA-512, which will practically never map two different pages to the same value so you can assume an equal hash value means equal input, or
- a locality sensitive hash function if you want to detect near-duplicates.
Which one to use really depends on your needs; crypto hashes are useless in near-duplicate detection, since they're designed to map near-duplicates to very different values.
I think you’re looking for fuzzy hashing where only portions of document are hashed instead of the whole document at once.
精彩评论