开发者

Hash Table lookup - with perfect hash, in C

开发者 https://www.devze.com 2023-04-02 19:29 出处:网络
I have a C-language app where I need to do table lookups. The entries are strings, All are known at the start of runtime. The table is initi开发者_如何学JAVAalized once, and then looked up many time

I have a C-language app where I need to do table lookups.

The entries are strings, All are known at the start of runtime. The table is initi开发者_如何学JAVAalized once, and then looked up many times. The table can change, but it's basically as if the app starts over. I think this means I can use a perfect-hash? It's ok to consume some time for the hashtable initialization, as it happens just once.

There will be between 3 and 100,000 entries, each one unique, and I estimate that 80% of cases will have fewer than 100 entries. A simple naive lookup is "fast enough" in those cases. (== no one is complaining)

However in the cases where there are 10k+ entries, the lookup speed of a naive approach is unacceptable. What's a good approach for delivering good hashtable-based lookup performance for strings in C? Assume I do not have a 3rd-party commercial library like Boost/etc. What hash algorithm should I use? how do I decide?


Generating a perfect hash is not a simple problem. There's libraries devoted to the task. In this case the most popular one is probably CMPH. I haven't used it though so can't help beyond that. gperf is another tool, but it requires the strings to be known at compile time (you could work around it by compiling a .so and loading, but kind of overkill).

But frankly, I'd at least try to go with a binary search first. Simply sort the array using qsort, then search with bsearch (or roll your own). Both those are part of stdlib.h since C89.


If a naive (I assume you mean linear) approach is ok for 100 entries (so 50 comparisons are done on average) then a binary search will be more than sufficient for 100,000 entries (it takes at most 17 comparisons).

So I wouldn't bother with hashes at all but just resort to sorting your string table on startup (e.g. using qsort) and later using a binary search (e.g. using bsearch) to look up entries.


If the (maximal) table size is known, a plain hashtable with chaining is very easy to implement. Size overhead is only two ints per item. With a reasonable hash function only 1.5 probes per lookup are needed on average, this for a 100% loaded table.

Constructing a perfect hash is only feasible if your data does not change. Once it changes, you'll have to recompute and rehash, which is way more expensive than doing a few extra compares.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号