Algorithm detect repeating/similiar strings in a corpus of data -- say email subjects, in Python [duplicate]_问答_开发者

This question already has answers here: 开发者_StackOverflow A better similarity ranking algorithm for variable length strings (25 answers) Closed 9 years ago.

I'm downloading a long list of my email subject lines , with the intent of finding email lists that I was a member of years ago, and would want to purge them from my Gmail account (which is getting pretty slow.)

I'm specifically thinking of newsletters that often come from the same address, and repeat the product/service/group's name in the subject.

I'm aware that I could search/sort by the common occurrence of items from a particular email address (and I intend to), but I'd like to correlate that data with repeating subject lines....

Now, many subject lines would fail a string match, but "Google Friends : Our latest news" "Google Friends : What we're doing today" are more similar to each other than a random subject line, as is: "Virgin Airlines has a great sale today" "Take a flight with Virgin Airlines"

So -- how can I start to automagically extract trends/examples of strings that may be more similar.

Approaches I've considered and discarded ('because there must be some better way'):

Extracting all the possible substrings and ordering them by how often they show up, and manually selecting relevant ones
Stripping off the first word or two and then count the occurrence of each sub string
Comparing Levenshtein distance between entries
Some sort of string similarity index ...

Most of these were rejected for massive inefficiency or likelyhood of a vast amount of manual intervention required. I guess I need some sort of fuzzy string matching..?

In the end, I can think of kludgy ways of doing this, but I'm looking for something more generic so I've added to my set of tools rather than special casing for this data set.

After this, I'd be matching the occurring of particular subject strings with 'From' addresses - I'm not sure if there's a good way of building a data structure that represents how likely/not two messages are part of the 'same email list' or by filtering all my email subjects/from addresses into pools of likely 'related' emails and not -- but that's a problem to solve after this one.

Any guidance would be appreciated.

I would first turn each string of characters into a set or multiset of words (ignoring punctuation and differences in lower/upper case). (If that's not powerful enough, in a second pass I could try pairs or even triples of adjacent words, known as bigrams and trigrams). The key measure of similarity between strings thus reduced is, what words that are not high frequency overall (not the, and, etc;-) are common to both strings, so a simple set intersection (or multiset intersection, but for your simple use case I think just sets would work fine, esp. sets of bigrams) should suffice to measure the "commonality". A word that's common to the two strings should be worth more the rarer it is, so negative log of frequency of the word across the whole corpus is an excellent starting point for this heuristics.

Smooth BLEU

You might be able to make use of the smooth-BLEU score between the subjects. BLEU is an evaluation metric that is used to score how similar the translations produced by a machine translation system are to translations produced by humans. Smooth BLEU is calculated like the normal BLEU score, except that you add one to the n-gram match counts so that you avoid multiplying anything by zero when evaluating short segments of text.

Smooth-BLEU should be much faster to compute than the Levenshtein distance, while still capturing word order information, since it looks at n-gram matches rather than just matches between single words.

Unfortunately, I don't have a pointer to a Python BLEU implementation, but you'll find a Perl implementation from NIST here.