开发者

Algorithm to find very common occurences of substrings in a set of short strings

开发者 https://www.devze.com 2023-04-13 00:31 出处:网络
I have a list of about 1500 strings from an external database and over time, as a group of business users managed them, they came to have recurring substrings which have semantic value.

I have a list of about 1500 strings from an external database and over time, as a group of business users managed them, they came to have recurring substrings which have semantic value.

I'm building a front-end and would like to present the user with filtering drop down list of those substrings.

For example if I have the input strings:

  • US foo
  • US bar (Inactive)
  • UK bat
  • UK baz (Inactive)
  • AU womp
  • AU rat

I want to get back:

  • US
  • UK
  • AU
  • Inactive

My first thoughts are to have a threshold parameter and a list of delimeters. For the above I might say threshold=.3 and delimiters are space, (, and ).

Then do a string.split on using the delimiters and use a datastructure like a set that that counts repeated items (?)...

I am not trying to have some开发者_如何学编程one do my work for me here - advice on the approach to take from someone who has done this would be great.


This problem is a good candidate for a Linq approach:

var words = from s in listOfStrings
            from word in s.Split(new[] { ' ', '(', ')' }, StringSplitOptions.RemoveEmptyEntries)
            group word by word;
var dic = words.ToDictionary(g => g.Key, g => g.Count());


A simple way would be something like you stated. Have a Dictionary<String, int> set up to contain your data. Then, it's easy:

for each word in string
   if word is in dictionary
      increment dictionary value
   else
      add to dictionary with value of 1

Then, simply filter that dictionary based on a threshold, or return the entries sorted by count. You may also choose to have an "ignore list" with common words you don't want to track.

Also, if you want case-insensitivity, construct the dictionary like this: new Dictionary<string, int>(StringComparer.OrdinalIgnoreCase);


var input = new List<string>();
input.Add("Foo"); // I'd go for splitting by delimiters as well
input.Add("Bar");
input.Add("Foo");
var results = input.Distinct(); // -> Foo, Bar

I'm not quite sure what your threshold is.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号