开发者

Google Interview Question [closed]

开发者 https://www.devze.com 2023-04-03 09:39 出处:网络
Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed. This question is off-topic. It is not currently accepting answers.

Want to improve this question? Update the question so it's on-topic for Stack Overflow.

Closed 11 years ago.

Improve this question

This was one of Google Interview questions.

What is the possible problem if Hash Table grows more tha开发者_StackOverflown 30 gb (ignore problems like bad hash function )

I didn't know it. What could be satisfactory answer ?

Thanks


The answer partly depends on whether they are talking about a classic hashtable implementation (like HashTable / HashMap in Java) or something more sophisticated. In the end, 30 GB of memory is still quite big for a single machine/VM by today's standards.

So think about what's going on underneath:

  1. It has to read write at an arbitrary position in some massive array.
  2. It has to grow if it fills up beyond some measure; see 'load factor' in Java implementation.
  3. In a garbage collected language/implementation, all the objects stored in the hash table need to be inspected by the garbage collector

Which leads to the following problems:

  1. It is unclear that even today's operating systems deal well with allocating chunks of memory in the tens of GBs
  2. For simplicity, say that half of the table was actually used by the table itself (not the key and value objects). So there is a 15 gb array inside. So every time the table grows, you need to allocate at least another 15 gb
  3. Even if a tens of GB array was allocated, the OS would page some of this memory. Since we are assuming a good hash function, we will break page caching if we use most of the data in the array. There will be a lot of page faults.
  4. Let's say we don't use all the data. Some keys are used frequently, and others are not. To illustrate, say that each key-value is tiny -- 128 bytes. And for simplicity, say that we store everything in the hash table as values. So 30G/128 = ~ 250M entries. But say 25k commonly used keys. (25k / 250M = 0.01%). But with good a hash function, these would be evenly scattered across the massive array. Even with small page sizes -- say 4kb, the 25K (entries) * 128 bytes (entry size) = ~3.5Mb worth of commonly used data costs us 25K (entries) * 4K (page size) = ~ 100Mb of memory that needs to be kept paged in... at a whopping 3.5% efficiency!
  5. In the Java world, practitioners don't recommend heap sizes larger that 4 - 8Gb. Sure there's stuff like Azul, but that simply proves the point -- a typical garbage collector don't scale to these sizes very well.

I agree with other posters that Google is looking for distributed as a solution. But I think at the heart, a simple hashtable stops scaling beyond a point. In the above,

  1. You would have to distribute if all the entries are accessed relatively evenly
  2. If some are accessed a majority of the time, using two maps (one for most frequently used) can buy you a lot.
  3. In the Java world, using specialized maps that store data off the heap can buy you performance too; see Peter Lawrey's work for example.
  4. Even simply striping the underlying array in a hashtable (like Java's ConcurrentHashMap does) can buy you major improvements when you have to grow the hashtable.


I think the interviewer was expecting something on the lines of Distributed Hash table, since a 30GB hash table cannot be stored on a single machine (at least in the current 64-bit world); From my personal experience, quite a few of the google Qs revolve around distributed computing, map-reduce etc.,


Some problems :

  1. Hash Collision might be one of the major problem possible.
  2. It will be also inefficient to make frequent disk reads when data store in disk as hash table.
0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号