开发者

Algorithm for sentence analysis and tokenization

开发者 https://www.devze.com 2023-01-01 02:58 出处:网络
I need to analyze a document and compile statistics as to how many times each a sequence of words is used (so the analysis is not on single words but of batch of recurring words).I read that compressi

I need to analyze a document and compile statistics as to how many times each a sequence of words is used (so the analysis is not on single words but of batch of recurring words). I read that compression algorithms do something similar to what I want - creating dictionaries of blocks of text with a piece of information reporting its frequency. It should be something similar to开发者_开发问答 http://www.codeproject.com/KB/recipes/Patterns.aspx Do you have anything written in C#?


This is very simple to implement.

  1. Use Split(a member function of string class) to split the string into words. (you can use the delimiters in the codeproject url).

  2. A forloop to enumerate all the n-gram out and use Dictionary<string, int> to get the count.

0

精彩评论

暂无评论...
验证码 换一张
取 消