Data structure for fixed length string lookup_问答_开发者

I have a bunch of strings as keys. Something like...

AAAA ABBA ACEA ALFG
...
...
ZURF [AAA _JFS aKDJ

They are all unique combination of any 4 characters and are all the same length. There are hundreds of thousands of these. I want to perform a lookup and retrieve the value associated with each string.

I currently have it implemented as a hash table, but the main concern is collisions (I've implemented all of the strategie开发者_Python百科s on Wiki).

I am thinking of implementing this as a prefix tree. Given the parameters though (unique, fixed length), I'm wondering if there is a out-of-the-box data structure I can't think of that would be best suited for this...

EDIT: Additionally, all possible combinations are populated once by a data file. Afterwards, lookups happen at wire speed.

Since you know all of the strings ahead of time, you can use gperf to generate a perfect hash function, which has no collisions. For example, with the four input strings AAAA ABBA ACEA ALFG, it generated the following hash function (using the command line gperf -L ANSI-C input.txt):

static unsigned int
hash (register const char *str, register unsigned int len)
{
  static unsigned char asso_values[] =
    {
      12, 12, 12, 12, 12, 12, 12, 12, 12, 12,
      12, 12, 12, 12, 12, 12, 12, 12, 12, 12,
      12, 12, 12, 12, 12, 12, 12, 12, 12, 12,
      12, 12, 12, 12, 12, 12, 12, 12, 12, 12,
      12, 12, 12, 12, 12, 12, 12, 12, 12, 12,
      12, 12, 12, 12, 12, 12, 12, 12, 12, 12,
      12, 12, 12, 12, 12,  7,  2,  5, 12, 12,
      12, 12, 12, 12, 12, 12,  0, 12, 12, 12,
      12, 12, 12, 12, 12, 12, 12, 12, 12, 12,
      12, 12, 12, 12, 12, 12, 12, 12, 12, 12,
      12, 12, 12, 12, 12, 12, 12, 12, 12, 12,
      12, 12, 12, 12, 12, 12, 12, 12, 12, 12,
      12, 12, 12, 12, 12, 12, 12, 12, 12, 12,
      12, 12, 12, 12, 12, 12, 12, 12, 12, 12,
      12, 12, 12, 12, 12, 12, 12, 12, 12, 12,
      12, 12, 12, 12, 12, 12, 12, 12, 12, 12,
      12, 12, 12, 12, 12, 12, 12, 12, 12, 12,
      12, 12, 12, 12, 12, 12, 12, 12, 12, 12,
      12, 12, 12, 12, 12, 12, 12, 12, 12, 12,
      12, 12, 12, 12, 12, 12, 12, 12, 12, 12,
      12, 12, 12, 12, 12, 12, 12, 12, 12, 12,
      12, 12, 12, 12, 12, 12, 12, 12, 12, 12,
      12, 12, 12, 12, 12, 12, 12, 12, 12, 12,
      12, 12, 12, 12, 12, 12, 12, 12, 12, 12,
      12, 12, 12, 12, 12, 12, 12, 12, 12, 12,
      12, 12, 12, 12, 12, 12
    };
  return len + asso_values[(unsigned char)str[1]];
}

const char *
in_word_set (register const char *str, register unsigned int len)
{
  static const char * wordlist[] =
    {
      "", "", "", "",
      "ALFG",
      "",
      "ABBA",
      "", "",
      "ACEA",
      "",
      "AAAA"
    };

  if (len <= MAX_WORD_LENGTH && len >= MIN_WORD_LENGTH)
    {
      register int key = hash (str, len);

      if (key <= MAX_HASH_VALUE && key >= 0)
        {
          register const char *s = wordlist[key];

          if (*str == *s && !strcmp (str + 1, s + 1))
            return s;
        }
    }
  return 0;
}

Which requires a single table lookup, a length comparison, and a string comparison. If you know for sure that the word you're hashing is one of your source words, then you can skip the string comparison.

Expanding the input size from 4 to 10000 randomly-generated strings increases the hash function to just 4 table lookups plus a length comparison and string comparison. But, since the string comparison has to store every source string in it, this comes out to a very large table in the compiled object file (1.4 MB). If you don't need to do the string comparison, you can omit that table.

A hash table, even with collisions, will outperform anything else, and you can tune it to reduce collisions.

First, transfer each string into an integer. If your alphabet contains 64 symbols (for example), you can use 4*6=24 bits integers as keys.

Now, if more than half of the possible keys are in use (as you say, there are hundreds of thousands of these), maybe to simplest solution will do: just build an array, an access it by index (the integer deduced from the string).

If possible, implement this with a single memory allocation. It may even save memory (The memory wasted due to 100,000's of small allocations).