N-gram split function for string similarity comparison_问答_开发者

As part of excersise to better understand F# which I am currently learning , I wrote function to split given string into n-grams.

1) I would like to receive feedback about my function : can this be written simpler or in more efficient way?

2) My overall goal is to write function that returns string similarity (on 0.0 .. 1.0 scale) based on n-gram similarity; Does this approach works well for short strings comparisons , or can this method reliably be used to compare large strings (like articles for example).

3) I am aware of the fact that n-gram comparisons ignore context of two strings. What method would you suggest to accomplish my goal?

//s:string - target string to split into n-grams
//n:int - n-gram size to split string into
let ngram_split (s:string, n:int) =
    let ngram_count = s.Length - (s.Length % n)
    let ngram_list = List.init ngram_count (fun i ->
        if( i + n >= s.Length ) then
        s.Substring(i,s.Length - i) + String.init ((i + n) - s.Length)
            (fun i -> "#")
        else
            s.Substring(i,n)
    )
    let ngram_array_unique = ngram_list
                            |> Seq.ofList
                            |> 开发者_如何学JAVASeq.distinct
                            |> Array.ofSeq

//produce tuples of ngrams (ngram string,how much occurrences in original string)

    Seq.init ngram_array_unique.Length (fun i -> (ngram_array_unique.[i],
        ngram_list |> List.filter(fun item -> item = ngram_array_unique.[i])
                                   |> List.length)
                                        )

I don't know much about evaluating similarity of strings, so I can't give you much feedback regarding points 2 and 3. However, here are a few suggestions that may help to make your implementation simpler.

Many of the operations that you need to do are already available in some F# library function for working with sequences (lists, arrays, etc.). Strings are also sequences (of characters), so you can write the following:

open System

let ngramSplit n (s:string) = 
  let ngrams = Seq.windowed n s
  let grouped = Seq.groupBy id ngrams
  Seq.map (fun (ngram, occurrences) -> 
    String(ngram), Seq.length occurrences) grouped

The Seq.windowed function implements a sliding window, which is exactly what you need to extract the n-grams of your string. The Seq.groupBy function collects the elements of a sequence (n-grams) into a sequence of groups that contain values with the same key. We use id to calculate the key, which means that the n-gram is itself the key (and so we get groups, where each group contains the same n-grams). Then we just convert n-gram to string and count the number of elements in the group.

Alternatively, you can write the entire function as a single processing pipeline like this:

let ngramSplit n (s:string) = 
  s |> Seq.windowed n
    |> Seq.groupBy id 
    |> Seq.map (fun (ngram, occurrences) -> 
         String(ngram), Seq.length occurrences)

Your code looks OK to me. Since ngram extraction and similarity comparison are used very often. You should consider some efficiency issues here.

The MapReduce pattern is very suitable for your frequency counting problem:

get a string, emit (word, 1) pairs out

do a grouping of the words and adds all the counting together.

let wordCntReducer (wseq: seq<int*int>) =

   wseq 
   |> Seq.groupBy (fun (id,cnt) -> id) 
   |> Seq.map (fun (id, idseq) -> 
            (id, idseq |> Seq.sumBy(fun (id,cnt) -> cnt)))
(* test: wordCntReducer [1,1; 2,1; 1,1; 2,1; 2,2;] *)

You also need to maintain a <word,int> map during your ngram building for a set of strings. As it is much more efficient to handle integers rather than strings during later processing.

(2) to compare the distance between two short strings. A common practice is to use Edit Distance using a simple dynamic programming. To compute the similarity between articles, a state-of-the-art method is to use TFIDF feature representation. Actuallym the code above is for term frequency counting, extracted from my data mining library.

(3) There are complex NLP methods, e.g. tree kernels based on the parse tree, to in-cooperate the context information in.

I think you have some good answers for question (1).

Question (2):

You probably want cosine similarity to compare two arbitrary collections of n-grams (the larger better). This gives you a range of 0.0 - 1.0 without any scaling needed. The Wikipedia page gives an equation, and the F# translation is pretty straightforward:

let cos a b = 
  let dot = Seq.sum (Seq.map2 ( * ) a b)
  let magnitude v = Math.Sqrt (Seq.sum (Seq.map2 ( * ) v v))
  dot / (magnitude a * magnitude b)

For input, you need to run something like Tomas' answer to get two maps, then remove keys that only exist in one:

let values map = map |> Map.toSeq |> Seq.map snd
let desparse map1 map2 = Map.filter (fun k _ -> Map.containsKey k map2) map1
let distance textA textB =
  let a = ngramSplit 3 textA |> Map.ofSeq
  let b = ngramSplit 3 textB |> Map.ofSeq
  let aValues = desparse a b |> values
  let bValues = desparse b a |> values
  cos aValues bValues

With character-based n-grams, I don't know how good your results will be. It depends on what kind of features of the text you are interested in. I do natural language processing, so usually my first step is part-of-speech tagging. Then I compare over n-grams of the parts of speech. I use T'n'T for this, but it has bizarro licencing issues. Some of my colleagues use ACOPOST instead, a Free alternative (as in beer AND freedom). I don't know how good the accuracy is, but POS tagging is a well-understood problem these days, at least for English and related languages.

Question (3):

The best way to compare two strings that are nearly identical is Levenshtein distance. I don't know if that is your case here, although you can relax the assumptions in a number of ways, eg for comparing DNA strings.

The standard book on this subject is Sankoff and Kruskal's "Time Warps, String Edits, and Maromolecules". It's pretty old (1983), but gives good examples of how to adapt the basic algorithm to a number of applications.

Question 3:

My reference book is Computing Patterns in Strings by Bill Smyth