开发者

Ruby: How does one test for similarity between two blocks of text?

开发者 https://www.devze.com 2023-04-11 07:15 出处:网络
So, lets say I have these texts: Text1: absolute obedience to the zerg collective sentience known as the Overmind. The Overmind directed the actions of every zerg creature in the Swarm, functioning

So, lets say I have these texts:

Text1:

absolute obedience to the zerg collective sentience known as the Overmind. The Overmind directed the actions of every zerg creature in the Swarm, functioning through a hierarchy of lesser sentients.

Text2:

zerg creature in the Swarm, functioning through a hierarchy of lesser sentients. Although the Overmind was primarily driven by its desire to consume and assimilate

Text 3

When the zerg first arrived in the Koprulu sector, they were unified by their absolute obedience to the zerg collective sentience known as the Overmind. The Overmind directed the actions of every zerg creature in the Swarm, functioning through a hierarchy of lesser sentients. Although the Overmind was primarily driven by its desire to consume and assimilate the advanced protoss race, it found useful but undeveloped material in humanity.

Now, The end of Text1 and the beginning of text2 overlap, so we'd say the text blocks aren't unique. Similarly, with Text3, Text1 can be found inside (as well as Text2) so this is also not unique, due to the overlap.

So, my question:

How do I go about writing something that can look at consecutive letters or words and determine uniqueness? Ideally, I'd want such a method to return some value, representing the amount of similarity--maybe the number of matched words over the average of the two text blocks' size. When it returns 0, both texts tested should be completely unique.

Some problem's I've run into when playing around with Ruby's string methods.

First, I started trying to find the intersection of two strings.

>> a = "nt version, there are no ch"  
>> b = "he current versi"  
>> (a.chars.to_a & b.chars.to_a).join  
=> "nt versihc"  

problem with the above method is that it just appends letters that are in common to the end of the result (we lose the order of characters), which would make it hard to test uniqueness. But I don't think intersection is the best way to start this similarity comparison. Any number of combinations of开发者_StackOverflow words could be present in both texts that are being compared. So maybe if I made an array of consecutive similarities... but that would require us to traverse one of the texts for as many times as we try phrase lengths.

I guess I really just don't know where to start, and in a way that is efficient and not O(n^too_high).


I believe you're looking for is the Longest Common Substring problem, i.e. the problem of finding, given two strings, of the longest substring they both have in common. The link is to the Wikipedia page which will help you understand the domain and contains a pseudocode example of an algorithm that runs in O(nm) time.

Further, Wikibooks' Algorithm Implementation book has an implementation in Ruby. It includes an lcs_size method that may be all you need. In short, if lcs_size(text1, text2) returns 4 that means text1 and text2 have very little consecutive text in common, probably just a single word, but if it returns, say, 40, they might have an entire sentence in common.

Hope that's helpful!


Here's a Ruby implementation of the Levenshtein distance algorithm. After installing the gem, you can use it like that:

require 'rubygems'
require 'Text'

t1 = "absolute obedience to the zerg collective sentience known as the Overmind. The Overmind directed the actions of every zerg creature in the Swarm, functioning through a hierarchy of lesser sentients."

t2 = "zerg creature in the Swarm, functioning through a hierarchy of lesser sentients. Although the Overmind was primarily driven by its desire to consume and assimilate"

puts Text::Levenshtein.distance(t1,t2)


Your problem is not Ruby. It's the algorithm. You could split each text into words, then run a minimum distance algorithm (http://en.wikipedia.org/wiki/Levenshtein_distance) to get that.

The smaller the number, the more similar the texts are.


This could be improved a lot but it's an idea:

txt1 = "absolute obedience to the zerg collective sentience known as the Overmind. The Overmind directed the actions of every zerg creature in the Swarm, functioning through a hierarchy of lesser sentients."
txt2 = "zerg creature in the Swarm, functioning through a hierarchy of lesser sentients. Although the Overmind was primarily driven by its desire to consume and assimilate"

def txt_to_ary(txt)
    txt.gsub(/\.|,/, ' ').downcase.split(/\s+/)
end

def longest_match(txt1, txt2)
    longest = 0
    txt1.each_with_index do |w1, i|
        txt2.each_with_index do |w2, j|
            next unless w1 == w2
            k = 0
            k += 1 while txt1[i+k] == txt2[j+k]
            longest = k if k > longest          
        end
    end
    longest
end

txt1 = txt_to_ary( txt1 )
txt2 = txt_to_ary( txt2 )

puts longest_match(txt1, txt2) #=>12


The amatch gem is perfect for string comparisons.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号