Hashing - What Does It Do?_问答_开发者_运维开发者技术经验分享

So I've been reading up on Hashing for my final exam, and I just cannot seem to grasp what is happening. Can someone explain Hashing to me the best way they understand it?

Sorry for th开发者_如何学运维e vague question, but I was hoping you guys would just be able to say "what hashing is" so I at least have a start, and if anyone knows any helpful ways to understand it, that would be helpful also.

Hashing is a fast heuristic for finding an object's equivalence class.

In smaller words:

Hashing is useful because it is computationally cheap. The cost is independent of the size of the equivalence class. http://en.wikipedia.org/wiki/Time_complexity#Constant_time

An equivalence class is a set of items that are equivalent. Think about string representations of numbers. You might say that "042", "42", "42.0", "84/2", "41.9..." are equivalent representations of the same underlying abstract concept. They would be in the same equivalence class. http://en.wikipedia.org/wiki/Equivalence_class

If I want to know whether "042" and "84/2" are probably equivalent, I can compute hashcodes for each (a cheap operation) and only if the hash codes are equal, then I try the more expensive check. If I want to divide representations of numbers into buckets, so that representations of the same number are in the buckets, I can choose bucket by hash code.

Hashing is heuristic, i.e. it does not always produce a perfect result, but its imperfections can be mitigated for by an algorithm designer who is aware of them. Hashing produces a hash code. Two different objects (not in the same equivalence class) can produce the same hash code but usually don't, but two objects in the same equivalence class must produce the same hash code. http://en.wikipedia.org/wiki/Heuristic#Computer_science

Hashing is summarizing.

The hash of the sequence of numbers (2,3,4,5,6) is a summary of those numbers. 20, for example, is one kind of summary that doesn't include all available bits in the original data very well. It isn't a very good summary, but it's a summary.

When the value involves more than a few bytes of data, some bits must get rejected. If you use sum and mod (to keep the sum under 2billion, for example) you tend to keep a lot of right-most bits and lose all the left-most bits.

So a good hash is fair -- it keeps and rejects bits equitably. That tends to prevent collisions.

Our simplistic "sum hash", for example, will have collisions between other sequences of numbers that also happen to have the same sum.

Firstly we should say about the problem to be solved with Hashing algorithm.

Suppose you have some data (maybe an array, or tree, or database entries). You want to find concrete element in this datastore (for example in array) as much as faster. How to do it?

When you are built this datastore, you can calculate for every item you put special value (it named HashValue). The way to calculate this value may be different. But all methods should satisfy special condition: calculated value should be unique for every item.

So, now you have an array of items and for every item you have this HashValue. How to use it? Consider you have an array of N elements. Let's put your items to this array according to their HashHalues.

Suppose, you are to answer for this question: Is the item "it1" exists in this array? To answer to it you can simply find the HashValue for "it1" (let's call it f("it1")) and look to the Array at the f("it1") position. If the element at this position is not null (and equals to our "it1" item), our answer is true. Otherwise answer is false.

Also there exist collisions problem: how to find such coolest function, which will give unique HashValues for all different elements. Actually, such function doesn't exist. There are a lot of good functions, which can give you good values.

Some example for better understanding:

Suppose, you have an array of Strings: A = {"aaa","bgb","eccc","dddsp",...}. And you are to answer for the question: does this array contain String S?

Firstle, we are to choose function for calculating HashValues. Let's take the function f, which has this meaning - for a given string it returns the length of this string (actually, it's very bad function. But I took it for easy understanding).

So, f("aaa") = 3, f("qwerty") = 6, and so on...

So now we are to calculate HashValues for every element in array A: f("aaa")=3, f("eccc")=4,...

Let's take an array for holding this items (it also named HashTable) - let's call it H (an array of strings). So, now we put our elements to this array according to their HashValues:

H[3] = "aaa", H[4] = "eccc",...

And finally, how to find given String in this array?

Suppose, you are given a String s = "eccc". f("eccc") = 4. So, if H[4] == "eccc", our answer will be true, otherwise it fill be false.

But how to avoid situations, when to elements has same HashValues? There are a lot of ways to it. One of this: each element in HashTable will contain a list of items. So, H[4] will contain all items, which HashValue equals to 4. And How to find concrete element? It's very easy: calculate fo this item HashValue and look to the list of items in HashTable[HashValue]. If one of this items equals to our searching element, answer is true, owherwise answer is false.

You take some data and deterministically, one-way calculate some fixed-length data from it that totally changes when you change the input a little bit.

a hash function applied to some data generates some new data. it is always the same for the same data. thats about it.

another constraint that is often put on it, which i think is not really true, is that the hash function requires that you cannot conclude to the original data from the hash. for me this is an own category called cryptographic or one way hashing.

there are a lot of demands on certain kinds of hash f unctions

for example that the hash is always the same length.

or that hashes are distributet randomly for any given sequence of input data.

the only important point is that its deterministic (always the same hash for the same data).

so you can use it for eample verify data integrity, validate passwords, etc.

read all about it here

http://en.wikipedia.org/wiki/Hash_function

You should read the wikipedia article first. Then come with questions on the topics you don't understand.

To put it short, quoting the article, to hash means:

to chop and mix

That is, given a value, you get another (usually) shorter value from it (chop), but that obtained value should change even if a small part of the original value changes (mix).

Lets take x % 9 as an example hashing function.

345 % 9 = 3
355 % 9 = 4
344 % 9 = 2
2345 % 9 = 5

You can see that this hashing method takes into account all parts of the input and changes if any of the digits change. That makes it a good hashing function.

On the other hand if we would take x%10. We would get

345 % 10 = 5
355 % 10 = 5
344 % 10 = 4
2345 % 10 = 5

As you can see most of the hashed values are 5. This tells us that x%10 is a worse hashing function than x%9.

Note that x%10 is still a hashing function. The identity function could be considered a hash function as well.

I'd say linut's answer is pretty good, but I'll amplify it a little. Computers are very good at accessing things in arrays. If I know that an item is in MyArray[19], I can access it directly. A hash function is a means of mapping lookup keys to array subscripts. If I have 193,372 different strings stored in an array, and I have a function which will return 0 for one of the strings, 1 for another, 2 for another, etc. up to 193,371 for the last one, I can see if any given string is in the array by running that function and then seeing if the given string matches the one in that spot in the array. Nice and easy.

Unfortunately, in practice, things are seldom so nice and tidy. While it's often possible to write a function which will map inputs to unique integers in a nice easy range (if nothing else:

  if (inputstring == thefirststring) return 0;
  if (inputstring == thesecondstring) return 1;
  if (inputstring == thethirdstring) return 1;
... up to the the193371ndstring

in many cases, a 'perfect' function would take so much effort to compute that it wouldn't be worth the effort.

What is done instead is to design a system where a hash function says where one should start looking for the data, and then some other means is used to search for the data from there. A few common approaches are:

Linear hashing -- If two items map to the same hash value, store one of them in the array slot following the one indicated by the hash code. When looking for an item, search in the indicated slot, and then next one, then the next, etc. until the item is found or one hits an empty slot. Linear hashing is simple, but it works poorly unless the table is much bigger than the number of items in it (leaving lots of empty slots). Note also that deleting items from such a hash table can be difficult, since the existence of an item may have prevented some other item from going into its indicated spot.
Double hashing -- If two items map to the same value, compute a different hash value for the second one added, and shove the second item that many slots away (if that slot is full, keep stepping by that increment until a vacant slot is found). If the hash values are independent, this approach can work well with a more-dense table. It's even harder to delete items from such a table, though, than with a linear hash table, since there's no nice way to find items which were displaced by the item to be deleted.
Nested hashing -- Each slot in the hash table contains a hash table using a different function from the main table. This can work well if the two hash functions are independent, but is apt to work very poorly if they aren't.
Chain-bucket hashing -- Each slot in the hash table holds a list of things that map to that hash value. If N things map to a particular slot, finding one of them will take time O(N). If the hash function is decent, however, most non-empty slots will contain only one item, most of those with more than that will contain only two, etc. so no slot will hold very many items.

When dealing with a fixed data set (e.g. a compiler's set of keywords), linear hashing is often good; in cases where it works badly, one can tweak the hash function so it will work well. When dealing with an unknown data set, chain bucket hashing is often the best approach. The overhead of dealing with extra lists may make it more expensive than double hashing, but it's far less likely to perform really horribly.