开发者

Can you explain theese disturbing anomalies of md5 and modulo?

开发者 https://www.devze.com 2023-02-21 02:49 出处:网络
Okay, the title is really very subjective. But thats just what the problem is to me. The background is that I want to distribute hits of static web contents evenly about a defined number of caching s

Okay, the title is really very subjective. But thats just what the problem is to me.

The background is that I want to distribute hits of static web contents evenly about a defined number of caching servers. Also the delivery to clients should speed up because several domains are in use and requests are not blocking each other. I also don't need a classic load balancer but generate the right links right away in my html code.

I also want to ensure that the same url always gets served by the same server.

So I just defined a little function that returns the host to use by hashing the request url and calculates the modulo by the number of servers in use:

function pseudocode_statify($url) { // $url looks like /folder1/folder2/file.jpg
 return 'http://' . md5($url) % $num_of_servers .'.mydomain.com' . $url;
}

I first had something like hex decoding and substring to prevent overflow in place, but found out that it just works fine the way above.

However my problem is that if I run the following test script:

for($i=0;$i<100000;$i++) {
  $md5 = md5(uniqid($i).microtime().rand(1,999999999999));
  $result[$md5%2]++;
}

I expected an even distribution. meaning that $result[0] would be near the value of $result[1];

This was not the case.

Okay, this is nothing special sofar. I would have just accepted the fact that md5 is not as evenly distributed as i thought and would have gone vor another hashing algorithm like sha1 or s开发者_如何学Goomething.

But I tried to reproduce the findings and found a pattern that I cannot explain.

The ratio was always about 2/1. In fact it was the ratio was always something like 1/2.16 to 1/2.17

Sample output of some runs of the script above:

output was generated by: echo "ratio: ".$result[0]/$result[1]."\n";

ratio: 2.1757121534504
ratio: 2.1729411578062
ratio: 2.1726559360393
ratio: 2.1676895664225
ratio: 2.1667416128848
ratio: 2.1667115284133
ratio: 2.1677791605385
ratio: 2.1658969579688
ratio: 2.1668508131769
ratio: 2.1689292821741

Now the weird thing was that the ratio of sums % 2 equaling 1 and sums % 2 equaling 0 sometimes alternated!

for($j = 0; $j<100;$j++) {
    for($i=0;$i<100000;$i++) {
      $md5 = md5(uniqid($i).microtime().rand(1,999999999999));
      $result[$md5%2]++;
    }
var_dump($result);
}

I ran the script from the command line two sperate times and aborted it after 3 runs and it produced theese two outputs:

joe@joe-laptop:/home/flimmit/httpdocs$ php test.php
PHP Notice:  Undefined variable: result in /home/flimmit/httpdocs/test.php on line 6
PHP Notice:  Undefined offset: 0 in /home/flimmit/httpdocs/test.php on line 6
PHP Notice:  Undefined offset: 1 in /home/flimmit/httpdocs/test.php on line 6
array(2) {
  [0]=>
  int(68223)
  [1]=>
  int(31777)
}
array(2) {
  [0]=>
  int(136384)
  [1]=>
  int(63616)
}
array(2) {
  [0]=>
  int(204498)
  [1]=>
  int(95502)
}
^C
joe@joe-laptop:/home/flimmit/httpdocs$ php test.php
PHP Notice:  Undefined variable: result in /home/flimmit/httpdocs/test.php on line 6
PHP Notice:  Undefined offset: 1 in /home/flimmit/httpdocs/test.php on line 6
PHP Notice:  Undefined offset: 0 in /home/flimmit/httpdocs/test.php on line 6
array(2) {
  [1]=>
  int(31612)
  [0]=>
  int(68388)
}
array(2) {
  [1]=>
  int(63318)
  [0]=>
  int(136682)
}
array(2) {
  [1]=>
  int(94954)
  [0]=>
  int(205046)
}
^C
joe@joe-laptop:/home/flimmit/httpdocs$ 

As you can see in the first one the first entry of results is always higher, in the second one its the other way round. same script.

Strange thing is that i can ONLY reproduce this behaviour when i run the script several times.

I wrote this small script to reproduce the "swapping" and generate enough measure data:

for($j = 0; $j<100;$j++) {
  for($i=0;$i<rand(1000,10000);$i++) {
    $md5 = md5(uniqid($i).microtime().rand(1,99999999));
    $result[$md5%2]++;
    }
    #var_dump($result);
    echo "ratio: ".$result[0]/$result[1]." ".(($result[0]<$result[1]) ? "A":"B")."\n";
    sleep(rand(2,5));
}

But here It only prints b, never A's. Which made me think there might be a semantic error in the script, but i didnt find any.

I am really stuck and this bothers me a lot.

So my questions:

  • Can you recommend any literature / weblinks were i could read about md5 a little bit deeper including distributions etc

  • Can you explain / reproduce the behaviour? Do I have an error here? (in fact thats very likely but i cant find it)

  • Can you recommend any other algorithm that would besuitable for my use case? It needs not be cryptographic or strong but fast, deterministic and evenly distributed.


The md5() function returns a string, not an integer.

Which means that this string will be type-casted to an integer to do the modulo ; and as this string will contain characters in the 0-9A-F range, casted to an integer, you have :

  • 1 chance out of 16 of getting a 0
  • 9 chances out of 16 of getting between 1 and 9
  • 6 chances out of 16 of getting between A and F -- which will be casted to a 0


For example, this :

$a = md5('plop1');
var_dump($a, (int)$a);

$a = md5('plop2');
var_dump($a, (int)$a);

$a = md5('plop5');
var_dump($a, (int)$a);

Will get you the following output :

string 'ac4bf0e466417336599b72a8b2f595da' (length=32)
int 0

string 'ed91c463402dd797d0718350f5bd0acd' (length=32)
int 0

string '85782b3afb04072c1bf172a6a7e6bb5e' (length=32)
int 85782

I'll let you guess the possible impact this can have on the result of the modulo operator ;-)


for ($i = 0; $i < 10; $i++) {
    srand(crc32('test_url1'));
    echo rand().'<br />';
}
for ($i = 0; $i < 10; $i++) {
    srand(crc32('test_url2'));
    echo rand().'<br />';
}

add a range to the rand function and you have your server value.


"The background is that I want to distribute hits of static web contents evenly about a defined number of caching servers."

Many, many load balancers already do that. Squid, nginx, Varnish, HAProxy...

Please do not write your own load balancer in PHP. Please.

0

精彩评论

暂无评论...
验证码 换一张
取 消