Subsequence search_问答_开发者_运维开发者技术经验分享

I have a large number of lists (35 MB in total) which I would like to search for subsequences: each term must appear in order but not necessarily consecutively. So 1, 2, 3 matches each of

1, 2, 3, 4, 5, 6
1, 2, 2, 3, 3, 3

but not

6, 5, 4, 3, 2, 1
123, 4, 5, 6, 7

(, is a delimiter, not characters to match.)

Short of running a regex (/1, ([^,]+, )*2, ([^,]+, )*3/ for the example) on tens or hundreds of thousands of sequences, how can I determine which sequences are a match? I can preprocess the sequences, though memory usage needs to stay reasonable (within a constant factor of the existing sequence size, say). The longest sequence is short, less than a kilobyte, so 开发者_如何学Goyou can assume queries are short as well.

This reminds me of sequence alignment from bioinformatics, where you try to match a small snippet of DNA against a large database. The differences are your presumably larger alphabet, and your increased tolerance for arbitrarily long gaps.

You may find some inspiration looking at the existing tools and algorithms, notably Smith-Waterman and BLAST.

If the individual numbers are spread out over the file and not occurring on the majority of lines then a simple indexing of the line number where they occur could give you a speed up. This will however be slower if your data are lines of the same numbers repeated in different orders.

To build the index would only require a single pass of the data along these lines:

Hash<int, List<int>> index

line_number = 1
foreach(line in filereader)
{
    line_number += 1
    foreach(parsed_number in line)
        index[parsed_number].append(line)
}

That index could be stored and reused for the dataset. To search on it would only need something like this. Please excuse the mixed psuedocode, I've tried to make it as clear as possible. It "return"s when it's out of possible matches and "yield"s a line number when all of the elements of the substring occur on that line.

// prefilled hash linking number searched for to a list of line numbers
// the lines should be in ascending order
Hash<int, List<int>> index

// The subsequence we're looking for
List<int> subsequence = {1, 2, 3}
int len = subsequence.length()

// Take all the lists from the index that match the numbers we're looking for
List<List<int>> lines = index[number] for number in subsequence

// holder for our current search row
// has the current lowest line number each element occurs on 
int[] search = new int[len]
for(i = 0; i < len; i++)
    search[i] = lines[i].pop()

while(true)
{
    // minimum line number, substring position and whether they're equal
    min, pos, eq = search[0], 0, true

    // find the lowest line number and whether they all match
    for(i = 0; i < len; i++)
    {
        if(search[i] < min)
            min, p, eq = search[i], i, false
        else if (search[i] > min)
            eq = false
    }

    // if they do all match every one of the numbers occurs on that row
    if(eq)
    {
        yield min; // line has all the elements

        foreach(list in lines)
            if(list.empty())  // one of the numbers isn't in any more lines
                 return

        // update the search to the next lowest line number for every substring element
        for(i = 0; i < len; i++)
            search[i] = lines[i].pop()
    }
    else
    {
        // the lowest line number for each element is not the same, so discard the lowest one
        if(lines[position].empty()) // there are no more lines for the element we'd be updating
            return

        search[position] = lines[position].pop();
    }
}

Notes:

This could trivially be extended to store the position in the line as well as the line number and then only a little extra logic at the "yield" point would be able to determine an actual match instead of just that all the items are present.
I've used "pop" to show how it's moving through the line numbers but you don't actually want to be destroying your index every search.
I've assumed the numbers all fit into ints here. Extend it to longs or even map the string representation of each number to an int if you have really huge numbers.
There are some speedups to be had, especially in skipping lines at the "pop" stages, but I went for the clearer explanation.

Whether using this or another method you could also chop down the computation depending on the data. A single pass to work out whether each line is ascending, descending, all odd, all even, or what the highest and lowest numbers are could be used to cut down the search space for each substring. Whether these would be useful depends entirely on your dataset.

maybe I misunderstood but, isn't this straightforward like this?

search = [1 2 3]
for sequence in sequences:
  sidx = 0
  for item in sequence:
    if item==search[sidx]:
       sidx++
       if sidx>=len(search): break
  if sidx>len(search):
    print sequence + "matches"

it seems to be O(N) for N sequences and O(M) for searching for subsequence length M

not sure if this would be that much faster than a regex though?