Converting a unique seed string into a random, yet deterministic, float value in Ruby_问答_开发者

I'm having a hard time with this, conceptually.

Basically, I need to accept some arbitrary unique string, and be able to convert that to a normalized float value. What the output float value is doesn't really matter, so long as the same string input always results in the same normalized float output.

So this is a hashing algorithm r开发者_如何学JAVAight? I'm familiar with SHA1 or MD5, and this seems similar to password hashing where the result is the same for the correct password. But those methods output strings of characters, I believe. And what I'm not getting is how I would turn the result of a SHA1 or MD5 into a consistent float value.

# Goal
def string_to_float(seed_string)
  # ...
end

string_to_float('abc-123') #=> 0.15789
string_to_float('abc-123') #=> 0.15789

string_to_float('def-456') #=> 0.57654
string_to_float('def-456') #=> 0.57654

So what kind of approach in Ruby can I take that would turn an arbitrary string into a random but consistent float value?

The key part that you want is a way of converting a SHA1 or MD5 hash output into a float that is both deterministic and 1-1. Here's a simple solution based on md5. This could be used as integers too.

require 'digest/md5'

class String
  def float_hash
    (Digest::MD5.hexdigest(self).to_i(16)).to_f
  end
end

puts "example_string".float_hash  # returns 1.3084281619666243e+38

This generates a hexadecimal hash, then converts it to an integer, then converts that to a float. Each step is deterministic.

Note: as pointed out by @emboss, this reduces collision resistance because a double is 8 bytes and the hash is 16 bytes. It shouldn't be a big deal though by the sounds of your application.

If security is no issue, what you are describing is in my opinion not a hash function. A hash function is a one-way function, meaning computing the hash is easy but reverting it is "hard" or, ideally, impossible.

Your requirements instead describe an injective function Given any x1, x2 in your domain X the following holds:

For all x1, x2 element of X, x1 != x2  => f(x1) != f(x2)

f(x) = x is such a function, f(x) = x² is not. In plain English: you want to have different results if your inputs are different, same results only if the inputs are the same. It is true that this also is true for secure hashes, but they additionally provide the one-way characteristics such as the property of not being able (easily) to find x if you are only given f(x), among others. As far as I understood, you don't need these security properties.

Trivially, such an injective mapping from String to Float would be given by simply interpreting the "String bytes" as "Float bytes" from now on, i.e. you interpret the bytes differently (think C:

unsigned char *bytes = "...";
double d = (double)bytes;

). But, there is as downside to this - the real trouble is that Float has a maximum precision, so you will run into an overflow situation if your strings are too long (Floats are internally represented as double values, that's 8 bytes on a 32 bit machine). So not enough space for almost any use case. Even MD5-ing your strings first doesn't solve the problem - MD5 output is already 16 bytes long.

So this could be a real problem, depending on your exact requirements. Although MD5 (or any other hash) will mess sufficiently with the input to make it as random as possible, you still cut the range of possible values from 16 bytes to effectively 8 bytes. (Note: Truncating random 16 byte output at 8 bytes is generally considered "secure" in terms of preserving the randomness. Elliptic Curve Cryptography does something similar. But as far as I know, nobody can really prove it, but neither could someone prove the contrary so far). So a collision is much more likely with your restricted Float range. By the birthday paradox, finding a collision takes sqrt(number of values in a finite range) tries. For MD5 this is 2^64, but for your scheme it's only 2^32. That's still very, very unlikely to produce a collision. It's probably something in the order of winning the lottery while at the same time being hit by a lightning. If you could live with this minimal possibility, go for it:

def string_to_float(str)
  Digest::MD5.new.digest(str).unpack('D')
end

If uniqueness is of absolute priority I would recommend to move from floats to integers. Ruby has built-in support for large integers that are not restricted by the internal constraints of a long value (that's what a Fixnum boils down to). So any arbitrary hash output could be represented as a large integer number.

Yes, you are describing a hashing algorithm. You could use a MD5 or SHA1 digest (since they just produce random bits) to generate a floating point number simply by using the String#unpack method with an argument of "G" (double-precision float, network byte order) from a digest:

require 'digest/sha1'

def string_to_float(str)
  Digest::SHA1.digest(str).unpack("G")[0]
end

string_to_float("abc-123") # => -2.86011943713676e-154
string_to_float("def-456") # => -1.13232994606094e+214
string_to_float("abc-123") # => -2.86011943713676e-154 OK!
string_to_float("def-456") # => -1.13232994606094e+214 OK!

Note that if you want the resulting floats to be in a particular range then you'll need to do some massaging.

Also note that the unpacked number doesn't use all of the bits from the digest so you might want to combine into the number of bytes for a double floating point number (although you'll have to be careful not to decrease the entropy of the hash function, if you care about that sort of thing), e.g.:

def str2float(s)
  d = Digest::SHA1.digest(s)
  x, y = d[0..9], d[10..19]
   # XOR the 1st (x) and 2nd (y) halves to use all bits.
  (0..9).map {|i| x[i] ^ y[i]}.pack("c*").unpack("G")[0]
end