开发者

Getting percentage of similarity of two texts

开发者 https://www.devze.com 2023-02-15 04:04 出处:网络
I need to get the score of the similarity between texts, when one is inside the second. For example: Text1: aaa bbb ccc ddd eee

I need to get the score of the similarity between texts, when one is inside the second.

For example:

Text1: aaa bbb ccc ddd eee
Text2: bbb ccc

I need somethig what say me, that Text2 is for 100开发者_如何学JAVA% inside the Text1. Is there some way to do this?


Depending on what you want you may try

  • length of longest common subsequence of both texts divided by length of text2
  • or length of longest contiguous subsequence of both texts also divided be length of text2

Both will give you 1 if the text is completely inside text1 and 0 if they do not share a common character.


You don't Lucene to obtain similarity between texts.There are several measures available depending on the text length, type of strings etc. and you will need to experiment which gives you the best results.

A pretty good and comprehensive collection of algorithms is available at SimMetrics is an F/OSS library that offers an extensive collection of similarity algorithms and their corresponding cost functions.


Please see the book Mining of Massive Datasets and Dekang Lin's definition of similarity (PDF). Both do not require Lucene.

0

精彩评论

暂无评论...
验证码 换一张
取 消