Tagging similar sentences with lower time complexity than n^2_问答_开发者

开发者 https://www.devze.com 2023-01-14 05:34 出处：网络

This is my first post, have been a lurker for a long time, so will try my best to explain myself here.

I have been using lowest common substring method along with basic word match and substring match(regexp) for clustering similar stories on the net. But the problem is its time complexity is n^2 (I compare each title to all the others). I've done very basic optimizations like storing and skipping all the matched titles.

What I want is some kind of preprocessing of the chunk of text so that for each iteration i reduce number of posts to match to. Any further optimizations are also welcome.

Here are the functions i use for the same. the m开发者_StackOverflow社区ain function which calls them first calls word_match, if more than 70% of the word matches i further go down and call 'substring_match' and LCSubstr_len. The code is in Python, I can use C as well

import re

def substring_match(a,b):
    try:
        c = re.match(a,b) 
        return c if c else True if re.match(b,a) else False
    except:
        return False

def LCSubstr_len(S, T):
    m = len(S); n = len(T)
    L = [[0] * (n+1) for i in xrange(m+1)]
    lcs = 0
    for i in xrange(m):
     for j in xrange(n):
         if S[i] == T[j]:
             L[i+1][j+1] = L[i][j] + 1
             lcs = max(lcs, L[i+1][j+1])
         else:
             L[i+1][j+1] = max(L[i+1][j], L[i][j+1])
    return lcs/((float(m+n)/2))

def word_match(str1,str2):
    matched = 0
    try:
        str1,str2 = str(str1),str(str2)
        assert isinstance(str1,str)
    except:
        return 0.0
    words1 = str1.split(None)
    words2 = str2.split(None)
    for i in words1:
        for j in words2:
            if i.strip() ==j.strip():
                matched +=1
    len1 = len(words1)
    len2 = len(words2)
    perc_match = float(matched)/float((len1+len2)/2)
    return perc_match

Use an inverted index: for each word, store a list of pairs (docId, numOccurences). Then, to find all strings which might be similar to a given string, go through its words and look up strings containing that word in the inverted index. This way you'll get a table "(docId, wordMatchScore)" that automatically contains only entries where wordMatchScore is non-zero.

There are a huge number of possible optimizations; also, your code is extremely non-optimal, but if we're talking about decreasing the number of string pairs for comparison, then that's it.

Speeding up word_match is easy with sets:

def word_match(str1,str2):
    # .split() splits on all whitespace, you dont needs .strip() after
    words1 = set(str1.split())
    words2 = set(str2.split())
    common_words = words1 & words2
    return 2.0*len(common_words)/(len(words1)+len(words2))

It also shows that 'A A A' and 'A' have 100% in common by this measure ...