开发者

Calculate each Word Occurrence in large document

开发者 https://www.devze.com 2023-04-08 10:07 出处:网络
I was wondering how can I solve this problem by using which data structure.. Can anyone explain this in detail...!! I was thinking to use tree.

I was wondering how can I solve this problem by using which data structure.. Can anyone explain this in detail...!! I was thinking to use tree.

There is a large document. Which contains millions of words. so how you will calculate a 开发者_运维问答each word occurrence count in an optimal way?

This question was asked in Microsoft... Any suggestions will be appreciated..!!


I'd just use a hash map (or Dictionary, since this is Microsoft ;) ) of strings to integers. For each word of the input, either add it to the dictionary if it's new, or increment its count otherwise. O(n) over the length of the input, assuming the hash map implementation is decent.


Using a dictionary or hash set will result in o(n) on average.

To solve it in o(n) worst case, a trie with a small change should be used: add a counter to each word representation in the trie; Each time a word that is inserted already exists, increment its counter.

If you want to print all the amounts at the end, you can keep the counters on a different list, and reference it from the trie instead storing the counter in the trie.


class IntValue
{
    public IntValue(int value)
    {
        Value = value;
    }
    public int Value;
}

static void Main(string[] args)
{
    //assuming document is a enumerator for the word in the document:

    Dictionary<string, IntValue> dict = new Dictionary<string, IntValue>();
    foreach (string word in document)
    {
        IntValue intValue;
        if(!dict.TryGetValue(word, out intValue))
        {
            intValue = new IntValue(0);
            dict.Add(word, intValue);
        }

        ++intValue.Value;
    }

    //now dict contains the counts
}


Tree would not work here.

Hashtable ht = new Hashtable();
// Read each word in the text in its order, for each of them:
if (ht.contains(oneWord))
{
    Integer I = (Integer) ht.get(oneWord));
    ht.put(oneWord, new Integer(I.intValue()+1));
}
else
{
     ht.put(oneWord, new Integer(1));
}
0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号