开发者

Splitting a string into words in a culture neutral way

开发者 https://www.devze.com 2023-02-25 19:21 出处:网络
I\'ve come up with the method below that aims to split a text of variable length into an array of words for further full text index processing (stop word removal, followed by stemmer). The results see

I've come up with the method below that aims to split a text of variable length into an array of words for further full text index processing (stop word removal, followed by stemmer). The results seem to be ok but I would like to hear opinions how reliable this implementation would against texts in different languages. Would you recommend using a regex for this instead? Please note that I've opted against using String.Split() because that would require me to pass a list of all known seperators which is exactly what I was trying to avoid when I wrote the function

P.S: I can't use a full blown full text search engine like Lu开发者_StackOverflowcene.Net for several reasons (Silverlight, Overkill for project scope etc).

public string[] SplitWords(string Text)
{
    bool inWord = !Char.IsSeparator(Text[0]) && !Char.IsControl(Text[0]);
    var result = new List<string>();
    var sbWord = new StringBuilder();

    for (int i = 0; i < Text.Length; i++)
    {
        Char c = Text[i];

        // non separator char?
        if(!Char.IsSeparator(c) && !Char.IsControl(c))
        {
            if (!inWord)
            {
                sbWord = new StringBuilder();
                inWord = true;
            }

            if (!Char.IsPunctuation(c) && !Char.IsSymbol(c))
                sbWord.Append(c);
        }

        // it is a separator or control char
        else
        {
            if (inWord)
            {
                string word = sbWord.ToString();
                if (word.Length > 0)
                    result.Add(word);

                sbWord.Clear();
                inWord = false;
            }
        }
    }

    return result.ToArray();
}


Since you said in culture neutral way, I really doubt if Regular Expression (word boundary: \b) will do. I have googled a bit and found this. Hope it would be useful.
I am pretty surprised that there is no built-in Java BreakIterator equivalent...

0

精彩评论

暂无评论...
验证码 换一张
取 消