开发者

check 1 billion cell-phone numbers for duplicates

开发者 https://www.devze.com 2023-04-11 14:00 出处:网络
It\'s an interview question: There are 1 billion cell-phone numbers which has 11 digits, they are stor开发者_开发技巧ed randomly in a file, for

It's an interview question:

There are 1 billion cell-phone numbers which has 11 digits, they are stor开发者_开发技巧ed randomly in a file, for example 12345678910, the first digit gotta be 1. Go through these numbers to see whether there is one with duplicate, just see if duplicate exists, if duplicate found, return True, or return False. Only 10 MB memory allowed.

Here is my solution:

Hash all these numbers into 1000 files using hash(num)%1000, then the duplicates should fall into the same file.

After the hashing, I got 1000 small files, each of which contains 1 million numbers at most, right? I'm not sure about this, I simply do it 1 billion / 1000 = 1 million.

Then for each file, build a hash table to store each number and a flag representing its occurrence.

I guess, it will take 5 B to represent the number, 4 B for the lower 8 digits and 1 B for the upper 3 digits; and actually 1 bit will suffice the flag, because I just need to find out whether duplicate exists, only how many times. But how can I apply the 1 bit flag to each number? I'm stumbled, so I choose bool to be the flag, 1 B is taken. So finally, each number in the hash table will take 5B<for number> + 1B<for flag> + 4B<for the next-pointer> = 10B, then each file will take 10M for the hash table.

That's my stupid solution, Please give me a better one.

Thanks.

FOLLOW UP:

If there are no duplicates in these 1 billion phone numbers, given one phone number, how to find out the given one is or is not in these 1 billion numbers? Use as few memory as possible.

I came up with 2 solutions,

  1. The phone number can be represented using 5B as I said above, scan through the file, read one number a time, and xor the given number with the one read from the file, if the result is 0, then the given one is in the file, it'll take O(n) time, right?

  2. Partition these numbers into 2 small files according to the leading bit, which means, those numbers with a leading 1-bit go to a file, leading 0-bit go to another file, meanwhile count how many numbers in each file, if the given number fall into the 1-bit file and the 1-bit file's count is not full, then again partition the 1-bit file according to the secondary leading-bit, and check the given number recursively; if the 1-bit file is full, then the given number gotta be in the file, it'll take O(logn) time, right?


Fastest solution (also in terms of programmer overhead :)

# Generate some 'phones'
yes 1 | perl -wne 'chomp; ++$a; print $_."$a\n";' > phones.txt

# Split phones.txt in 10MB chunks
split -C 10000000 phones.txt

# Sort each 10MB chunk with 10MB of memory
for i in x??; do sort -S 10M $i > $i.srt; echo -ne "$i.srt\0" >> merge.txt; done

# Merge the shorted chunks with 10MB of memory
sort -S 10M --files0-from=merge.txt -m > sorted.txt

# See if there is any duplicates
test -z $(uniq -d merge.txt)

Check that the memory usage constraint is met with pmap $(pidof sort) for example:


After the hashing, I got 1000 small files, each of which contains 1 million numbers at most, right

Not true, in extreme case it's possible that one file contains all the numbers.

Create the files based on the first or last x digits of the numbers (ignore the starting 1). When creating those files you can actually chop those digits because they are equal within a file. This is a lot better than hashing because although all the numbers can still end up in one file, now the range of those numbers is limited, so you can fit it into 10MB.

Each number can be represeted by a simple bit because the only information you need is whether the number occured previously. You don't have to store the actual numbers, the address of the bit is the number. In 10MB you can store 80M bits, so you will need 1G/80M = 12.5 files, but remember, those digits must differ so actually you will need 100 files (x=2).

Finally, you don't have to create those files, you can also scan the whole file multiple times. In this case you can have multiple bit-maps in memory as one doesn't occupy 10MB.

I strongly suggest reading this book, it starts with an almost identical example: http://www.amazon.co.uk/Programming-Pearls-ACM-Press-Bentley/dp/0201657880


No need for hash, 10M = 83886080 bits, put each number into [0, 83886080), [83886080, 83886080 * 2) ... [xx, 9999999999) (don't consider first digit), about 999999999 / 83886080 = 120 files, then build the bit set, it takes O(n) totally.


You can follow the bitset technique. Refer to this question and answers : Find an integer not among four billion given ones


the interview question imposes only a limit on the memory used, not on the time it takes to provide an answer.

it is thus reasonable to implement this question like this:

take the first number
compare it to all numbers following it
take the second number
compare it to all numbers following it
...

this takes an enormous amount of time for processing the billion numbers (O(n^2)), but does not take more than 10MB of memory space.


You can use Bloom Filters which contains m bit array and uses k hash functions. Though I am not sure about how many hash functions you may need.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号