Optimising partial dictionary key match_问答_开发者

I have a dictionary which uses a 4-tuple as it's key. I need to find all the keys in the dictionary which partially match some other tuple. I have some code which does this but it's slow and needs optimizing.

Here's what I'm after:

Keys:
(1, 2, 3, 4)
(1, 3, 5, 2)
(2, 4, 8, 7)
(1, 4, 3, 4)
Match:
(1, None, 3, None)
Result:
[(1, 2, 3, 4), (1, 4, 3, 4)]

Current code:

def GetTuples(self, keyWords):
    tuples = []
    for k in self.chain.iterkeys():
        match = True
        for i in range(s开发者_运维问答elf.order):
            if keyWords[i] is not None and keyWords[i] != k[i]:
                match = False
                break
        if match is True:
            tuples.append(k)
    return tuples

keyWords is a list containing the values I want to match
self.chain is the dictionary
self.order is the size of the tuple
len(keyWords) always = len(k)
'None' is considered the wild card
The dictionary is pretty huge (this method is taking ~800ms to run and about 300mb), so space is also a consideration

I'm basically looking for either optimizations to this method, or a better way of storing this data.

You can't optimize this further if you store your data in a plain dictionary, since it does not provide anything faster then a sequential access to all elements of your dictionary in some unpredictable order. This means that your solution is not faster then O(n).

Now, databases. Database is not a universal solution to any (complex enough) problem. Can you reliably estimate the speed/complexity of such lookups for a database? If you scroll to the bottom of this reply, you will see that for large data sets database performance can be much worse than a smart data structure.

What you need here is a hand-crafted data structure. There is a number of choices, it strongly depends on other stuff you are doing with this data. For example: you can keep N sets of sorted lists of your keys, each sorted by n-th tuple element. Then you can quickly select N sorted sets of elements matching only one tuple element at position n, and find their intersection to get the results. This would give an average performance of O(log n)*O(m) where m is an average number of elements in one subset.

Or you can store you items in a k-d tree, this means that you have to pay O(log n) insertion price, but you can do queries like the one above in O(log n) time. Here is an example in python, using k-d tree implementation from SciPy:

from scipy.spatial import kdtree
import itertools
import random

random.seed(1)
data = list(itertools.permutations(range(10), 4))
random.shuffle(data)
data = data[:(len(data)/2)]

tree = kdtree.KDTree(data)

def match(a, b):
    assert len(a) == len(b)
    for i, v in enumerate(a):
        if v != b[i] and (v is not None) and (b[i] is not None):
            return False
    return True

def find_like(kdtree, needle):
    assert len(needle) == kdtree.m
    def do_find(tree, needle):
        if hasattr(tree, 'idx'):
            return list(itertools.ifilter(lambda x: match(needle, x),
                                          kdtree.data[tree.idx]))
        if needle[tree.split_dim] is None:
            return do_find(tree.less, needle) + do_find(tree.greater, needle)
        if needle[tree.split_dim] <= tree.split:
            return do_find(tree.less, needle)
        else:
            return do_find(tree.greater, needle)
    return do_find(kdtree.tree, needle)

def find_like_bf(kdtree, needle):
    assert len(needle) == kdtree.m
    return list(itertools.ifilter(lambda x: match(needle, x),
                                  kdtree.data))

import timeit
print "k-d tree:"
print "%.2f sec" % timeit.timeit("find_like(tree, (1, None, 2, None))",
                                "from __main__ import find_like, tree",
                                number=1000)
print "brute force:"
print "%.2f sec" % timeit.timeit("find_like_bf(tree, (1, None, 2, None))",
                                "from __main__ import find_like_bf, tree",
                                number=1000)

And test run results:

$ python lookup.py
k-d tree:
0.89 sec
brute force:
6.92 sec

Just for fun, also added database-based solution benchmark. The initialization code changed from above to:

random.seed(1)
data = list(itertools.permutations(range(30), 4))
random.shuffle(data)

Now, the "database" implementation:

import sqlite3

db = sqlite3.connect(":memory:")
db.execute("CREATE TABLE a (x1 INTEGER, x2 INTEGER, x3 INTEGER, x4 INTEGER)")
db.execute("CREATE INDEX x1 ON a(x1)")
db.execute("CREATE INDEX x2 ON a(x2)")
db.execute("CREATE INDEX x3 ON a(x3)")
db.execute("CREATE INDEX x4 ON a(x4)")

db.executemany("INSERT INTO a VALUES (?, ?, ?, ?)",
               [[int(x) for x in value] for value in tree.data])

def db_test():
    cur = db.cursor()
    cur.execute("SELECT * FROM a WHERE x1=? AND x3=?", (1, 2))
    return cur.fetchall()

print "sqlite db:"
print "%.2f sec" % timeit.timeit("db_test()",
                                 "from __main__ import db_test",
                                 number=100)

And test results, reduced for 100 runs per benchmark (for resulting 657720-element set of keys):

$ python lookup.py
building tree
done in 6.97 sec
building db
done in 11.59 sec
k-d tree:
1.90 sec
sqlite db:
2.31 sec

It also worth mentioning that building tree took almost twice less time then inserting this test data set into the database.

Complete source here: https://gist.github.com/1261449

What about just using a database?

I prefer SQLite + SQLAlchemy even for simple projects, but plain sqlite3 might have a gentler learning curve.

Putting an index on each key column should take care of speed issues.

Perhaps you could speed it up by maintaining indexes for your keys. Essentially, something like this:

self.indices[2][5]

would contain a set of all of the keys which have 5 in the third position of the key.

Then you could simply do set intersection between the relevant index entries to get the set of keys:

matching_keys = None

for i in range(self.order):
    if keyWords[i] is not None:
        if matching_keys is None:
            matching_keys = self.indices[i][keyWords[i]]
        else:
            matching_keys &= self.indices[i][keyWords[i]]

matching_keys = list(matching_keys) if matching_keys else []

riffing on Amber's answer:

>>> from collections import defaultdict
>>> index = defaultdict(lambda:defaultdict(set))
>>> keys = [(1, 2, 3, 4),
...         (1, 3, 5, 2),
...         (2, 4, 8, 7),
...         (1, 4, 3, 4),
...         ]
>>> for key in keys:
...     for i, val in enumerate(key):
...         index[i][val].add(key)
... 
>>> def match(goal):
...     res = []
...     for i, val in enumerate(goal):
...         if val is not None:
...             res.append(index[i][val])
...     return reduce(set.intersection, res)
... 
>>> match((1, None, 3, None))
set([(1, 4, 3, 4), (1, 2, 3, 4)])