Hashing floating point values_问答_开发者_运维开发者技术经验分享

开发者 https://www.devze.com 2023-04-04 18:28 出处：网络

Recently, I was curious how hash algorithms for floating points worked, so I looked at the source code for boost::hash_value.It turns out to be fairly complicated.The actual implementation loops over

Recently, I was curious how hash algorithms for floating points worked, so I looked at the source code for boost::hash_value. It turns out to be fairly complicated. The actual implementation loops over each digit in the radix and accumulates a hash value. Compared to the integer hash functions, it's much more involved.

My question is: why should a floating-point hash algorithm be any more complicated? Why not just hash the binary representation of the floating point value as i开发者_开发百科f it was an integer?

Like:

std::size_t hash_value(float f)
{
  return hash_value(*(reinterpret_cast<int*>(&f)));
}

I realize that float is not guaranteed to be the same size as int on all systems, but that sort of thing could be handled with a few template meta-programs to deduce an integral type that is the same size as float. So what is the advantage of introducing an entirely different hash function that specifically operates on floating point types?

Take a look at https://svn.boost.org/trac/boost/ticket/4038

In essence it boils down to two things:

Portability: when you take the binary representation of a float, then on some platform it could be possible that a float with a same value has multiple representations in binary. I don't know if there is actually a platform where such an issue exists, but with the complication of denormelized numbers, I'm not sure if this might actually happen.
the second issue is what you proposed, it might be that sizeof(float) does not equal sizeof(int).

I did not find anyone mentioning that the boost hash indeed avoids fewer collisions. Although I assume that separating the mantissa from the exponent might help, but the above link does not suggest that this was the driving design decision.

One reason not to just use the bit pattern is that some different bit patterns must be considered equals and thus have the same hashcode, namely

positive and negative zero
possibly denormalized numbers (I don't think this can occur with IEEE 754, but C allows other float representations).
possibly NANs (there are many, at least in IEEE 754. It actually requires NAN patterns to be considered unequal to themselves, which arguably means the cannot be meaningfully used in a hashtable)

Why are you wanting to hash floating point values? For the same reason that comparing floating point values for equality has a number of pitfalls, hashing them can have similar (negative) consequences.

However given that you really do want to do this, I suspect that the boost algorithm is complicated because when you take into account denormalized numbers different bit patterns can represent the same number (and should probably have the same hash). In IEEE 754 there are also both positive and negative 0 values that compare equal but have different bit patterns.

This probably wouldn't come up in the hashing if it wouldn't have come up otherwise in your algorithm but you still need to take care about signaling NaN values.

Additionally what would be the meaning of hashing +/- infinity and/or NaN? Specifically NaN can have many representations, should they all result in the same hash? Infinity seems to have just two representations so it seems like it would work out ok.

I imagine it's so that two machines with incompatible floating-point formats hash to the same value.