Approximate string matching using backtracking_问答_开发者

I would like to use backt开发者_JAVA技巧racking to search for all substrings in a long string allowing for variable length matches - that is matches allowing for a maximum given number of mismatches, insertions, and deletions. I have not been able to locate any useful examples. The closest I have found is this paper here, but that is terribly complex. Anyone?

Cheers,

Martin

Algorithm

The function ff() below uses recursion (i.e. backtracking) to solve your problem. The basic idea is that at the start of any call to f(), we are trying to match a suffix t of the original "needle" string to a suffix s of the "haystack" string, while allowing only a certain number of each type of edit operation.

// ss is the start of the haystack, used only for reporting the match endpoints.
void f(char* ss, char* s, char* t, int mm, int ins, int del) {
    while (*s && *s == *t) ++s, ++t;    // OK to always match longest segment
    if (!*t) printf("%d\n", s - ss);    // Matched; print endpoint of match
    if (mm && *s && *t) f(ss, s + 1, t + 1, mm - 1, ins, del);
    if (ins && *s) f(ss, s + 1, t, mm, ins - 1, del);
    if (del && *t) f(ss, s, t + 1, mm, ins, del - 1);
}

// Find all occurrences of t starting at any position in s, with at most
// mm mismatches, ins insertions and del deletions.
void ff(char* s, char* t, int mm, int ins, int del) {
    for (char* ss = s; *s; ++s) {
//      printf("Starting from offset %d...\n", s - ss);
        f(ss, s, t, mm, ins, del);
    }
}

Example call:

ff("xxabcydef", "abcdefg", 1, 1, 1);

This outputs:

9
9

because there are two ways to find "abcdefg" in "xxabcydef" with at most 1 of each kind of change, and both of these ways end at position 9:

Haystack: xxabcydef-
Needle:     abc-defg

which has 1 insertion (of y) and 1 deletion (of g), and

Haystack: xxabcyde-f
Needle:     abc-defg

which has 1 insertion (of y), 1 deletion (of f), and 1 substitution of g to f.

Dominance Relation

It may not be obvious why it's actually safe to use the while loop on line 3 to greedily match as many characters as possible at the start of the two strings. In fact this may reduce the number of times that a particular end position will be reported as a match, but it will never cause an end position to be forgotten completely -- and since we're usually interested in just whether or not there is a match ending at a given position of the haystack, and without this while loop the algorithm would always take time exponential in the needle size, this seems a win-win.

It is guaranteed to work because of a dominance relation. To see this, suppose the opposite -- that it is in fact unsafe (i.e. misses some matches). Then there would be some match in which an initial segment of equal characters from both strings are not aligned to each other, for example:

Haystack: abbbbc
Needle:   a-b-bc

However, any such match can be transformed into another match having the same number of operations of each type, and ending at the same position, by shunting the leftmost character following a gap to the left of the gap:

Haystack: abbbbc
Needle:   ab--bc

If you do this repeatedly until it's no longer possible to shunt characters without requiring a substitution, you will get a match in which the largest initial segment of equal characters from both strings are aligned to each other:

Haystack: abbbbc
Needle:   abb--c

My algorithm will find all such matches, so it follows that no match position will be overlooked by it.

Exponential Time

Like any backtracking program, this function will exhibit exponential slowdowns on certain inputs. Of course, it may be that on the inputs you happen to use, this doesn't occur, and it works out faster than particular implementations of DP algorithms.

I would start with Levenshtein's distance algorithm, which is the standard approach when checking for string similarities via mismatch, insertion and deletion.

Since the algorithm uses bottom up dynamic programming, you'll probably be able to find all substrings without having to execute the algorithm for each potential substring.

The nicest algorithm I'm aware of for this is A Fast Bit-Vector Algorithm for Approximate String Matching Based on Dynamic Programming by Gene Myers. Given a text to search of length n, a pattern string to search for of length m and a maximum number of mismatches/insertions/deletions k, this algorithm takes time O(mn/w), where w is your computer's word size (32 or 64). If you know much about algorithms on strings, it's actually pretty incredible that an algorithm exists that takes time independent of k -- for a long time, this seemed an impossible goal.

I'm not aware of an existing implementation of the above algorithm. If you want a tool, agrep may be just what you need. It uses an earlier algorithm that takes time O(mnk/w), but it's fast enough for low k -- miles ahead of a backtracking search in the worst case.

agrep is based on the Shift-Or (or "Bitap") algorithm, which is a very clever dynamic programming algorithm that manages to represent its state as bits in an integer and get binary addition to do most of the work of updating the state, which is what speeds up the algorithm by a factor of 32 or 64 over a more typical implementation. :) Myers's algorithm also uses this idea to get its 1/w speed factor.