Data structure with O(1) insertion time and O(log m) lookup?_问答_开发者

Backstory (skip to second-to-last paragraph for data structure part): I'm working on a compression algorithm (of the LZ77 variety). The algorithm boils down to finding the longest match between a given string and all strings that have already been seen.

To do this quickly, I've used a hash table (with separate chaining) as recommended in th开发者_开发知识库e DEFLATE spec: I insert every string seen so far one at a time (one per input byte) with m slots in the chain for each hash code. Insertions are fast (constant-time with no conditional logic), but searches are slow because I have to look at O(m) strings to find the longest match. Because I do hundreds of thousands of insertions and tens of thousands of lookups in a typical example, I need a highly efficient data structure if I want my algorithm to run quickly (currently it's too slow for m > 4; I'd like an m closer to 128).

I've implemented a special case where m is 1, which runs very fast buts offers only so-so compression. Now I'm working on an algorithm for those who'd prefer improved compression ratio over speed, where the larger m is, the better the compression gets (to a point, obviously). Unfortunately, my attempts so far are too slow for the modest gains in compression ratio as m increases.

So, I'm looking for a data structure that allows very fast insertion (since I do more insertions than searches), but still fairly fast searches (better than O(m)). Does an O(1) insertion and O(log m) search data structure exist? Failing that, what would be the best data structure to use? I'm willing to sacrifice memory for speed. I should add that on my target platform, jumps (ifs, loops, and function calls) are very slow, as are heap allocations (I have to implement everything myself using a raw byte array in order to get acceptable performance).

So far, I've thought of storing the m strings in sorted order, which would allow O(log m) searches using a binary search, but then the insertions also become O(log m).

Thanks!

You might be interested in this match-finding structure :

http://encode.ru/threads/1393-A-proposed-new-fast-match-searching-structure

It's O(1) insertion time and O(m) lookup. But (m) is many times lower than a standard Hash Table for an equivalent match finding result. As en example, with m=4, this structure gets equivalent results than an 80-probes hash table.

You might want to consider using a trie (aka prefix tree) instead of a hash table.

For your particular application, you might be able to additionally optimize insertion. If you know that after inserting ABC you're likely to insert ABCD, then keep a reference to the entry created for ABC and just extend it with D---no need to repeat the lookup of the prefix.

One common optimization in hash tables is to move the item you just found to the head of the list (with the idea that it's likely to be used again soon for that bucket). Perhaps you can use a variation of this idea.

If you do all of your insertions before you do your lookups, you can add a bit to each bucket that says whether the chain for that bucket is sorted. On each lookup, you can check the bit to see if the bucket is sorted. If not, you would sort the bucket and set the bit. Once the bucket is sorted, each lookup is O(lg m).

If you interleave insertions and lookups, you could have 2 lists for each bucket: one that's sorted and on that isn't. Inserts would always go to the non-sorted list. A lookup would first check the sorted list, and only if it's not there would it look in the non-sorted list. When it's found in the non-sorted list you would remove it and put it in the sorted list. This way you only pay to sort items that you lookup.