Finding Similar Paragraphs in Different Documents [closed]_问答_开发者

Closed. This question is seeking recommendations for books, tools, software libraries, and more. It does not meet Stack Overflow guid开发者_JAVA百科elines. It is not currently accepting answers.

We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.

Closed 6 years ago.

Improve this question

I am trying to figure out what type of analysis and currently available libraries there are out there to, when feed in two text paragraphs, determine if there is some degree of similarity between the two (whether with regards to subjects and verbs, or context). I imagine there might be some NLP type analysis that could be done but seeing what experiences people have had in these solutions.

For example, consider the following two paragraphs:

Governing Law. This Agreement is governed by the laws of the State of Missouri without reference to its conflict of law principles.
Law to Apply. The laws of the State of Missouri shall apply to this Agreement.

Fundamentally, I'd want the have those two clauses picked as identical.

I am looking for a library available under .Net.

Well that is an interesting question. You could use NLTK to extract the core concepts (Noun groups) and compare those directly. In this case you'd get:

Governing Law, Agreement, Laws, State, Missouri, conflict, law principles
Law, laws, State, Missouri, Agreement

Now, similarity is not bi-directional. Group 2 is fully represented in Group 1, but not the other way around. You could apply a harmonic mean where you count the percentage of a group in another group so G21 would be 1.0 and G12 would be 0.57. So the harmonic mean would be H = 2AB/(A+B) == 2(1.0)*(0.57)/(1.0 + 0.57) = 0.72.

Now, this isn't identical but in your example you wanted there to be a match between the two paragraphs. In this case their harmonic mean, H, is 0.72. The higher the number, the harder it is to achieve. H>0.8 is considered good. H>0.9 for most systems is exceptional. So what you must decide is where do you want your arbitrary line in the sand drawn? It has to be arbitrary because you haven't given a definition of the degree of similarity. So do you set it at 0.6, 0.7? How about 0.12948737? A good way of discovering this threshold is to take test examples and without doing the math just judge for yourself their similarity and then run the numbers and see what you come up with.

I don't know whether there is a .NET implementation, but you can easily code this yourself.

You can use a reversed n-gram index (A), look up n-grams in your search paragraph (B), calculate common n-grams divided by total n-grams (C) which gives you a probability measure, for which you can set a threshold and probably do other stuff as well.

(A) Create a reversed n-gram index: get all n-grams from the paragraphs you want to search through and store them in a db.

Example:

A small corpus consisting of the following (short) paragraphs: {rain, spain, main, plain}

has the following 3-grams: {#ra, rai, ain, in#, #sp, spa, pai, #ma, mai, #pl, pla, lai}

You get the following key-value pairs: {(#ra, rain), (rai, rain), (ain, rain), (ain, spain), (ain, main), (ain, plain), (in#, rain), (in#, spain), (in#, main), (in#, plain), (#sp, spain), (spa, spain), (pai, spain), (#ma, main), (mai, main), (#pl, plain), (pla, plain), (lai, plain)}

(B) When looking up a paragraph to match against the corpus, calculate all its n-grams, look up each n-gram in the reversed index.

Example:

Look up "pain" in the db we just created. We get the following n-grams: {#pa, pai, ain, in#}

Here are the matching entries:

#pa -> no match

pai -> {spain}

ain -> {rain, spain, main, plain}

in# -> {rain, spain, main, plain}

(C) Calculate a score for each entry you find by dividing the amount of its common n-grams and the union of all n-grams in search paragraph and result paragraph.

Example

pain vs. spain: common n-grams: {pai, ain, in#}, union of n-grams {#pa, #sp, pai, ain, in#}, score: 3/5 (0.6)

pain vs. rain: common n-grams: {ain, in#}, union of n-grams {#pa, #ra, ain, in#}, score: 2/4 (0.5)

pain vs. main: common n-grams: {ain, in#}, union of n-grams {#pa, #ma, ain, in#}, score: 2/4 (0.5)

pain vs. plain: common n-grams: {ain, in#}, union of n-grams {#pa, #pl, lai, ain, in#}, score: 2/5 (0.4)

Implementation details:

Make the n in n-grams configurable both in search and storage, this gives you flexibility to experiment with 3-grams, 4-grams, 5-grams, ...
Use a fast (in-memory/in-process) db, such as SQLite or BerkeleyDb
Make sure duplicate keys are allowed (BerkeleyDb allows this - but you need to find a workaround for duplicate key-value pairs, for example "maintain" has twice the 3-grams ain and in#, so technically you cannot store this in the db)

Happy coding

I suggest you want to use the same algorithm Google uses to find duplicate documents on the web http://www.cs.sunysb.edu/~cse692/papers/henzinger_sigir06.pdf

Hash the phrases using Rubin's algorithm, sort these hashes and compare the bottom 10. Very fast.

Patrick.