How to extend WhitespaceTokenizer?_问答_开发者

开发者 https://www.devze.com 2023-04-07 16:23 出处：网络

I need to use a tokenizer that splits words 开发者_开发知识库on whitespace but that doesn\'t split if the whitespace is whithin double parenthesis. Here an example:

I need to use a tokenizer that splits words 开发者_开发知识库on whitespace but that doesn't split if the whitespace is whithin double parenthesis. Here an example:

My input-> term1 term2 term3 ((term4 term5)) term6

should produce this list of tokens:

term1, term2, term3, ((term4 term5)), term6.

I think that I can obtain this behaviour by extending Lucene WhiteSpaceTokenizer. How can I perform this extension?

Is there some other solutions?

Thanks in advance.

I haven't tried to extend the Tokenizer, but i have here a nice (i think) solution with a regular expression:

\w+|\(\([\w\s]*\)\)

And a method that split a string by matched groups from the reg ex returning an array. Code example:

class Regex_ComandLine {

public static void main(String[] args) {
    String input = "term1 term2 term3 ((term4 term5)) term6";    //your input
    String[] parsedInput = splitByMatchedGroups(input, "\\w+|\\(\\([\\w\\s]*\\)\\)");

    for (String arg : parsedInput) {
        System.out.println(arg);
    }
}

static String[] splitByMatchedGroups(String string,
                                            String patternString) {
    List<String> matchList = new ArrayList<>();
    Matcher regexMatcher = Pattern.compile(patternString).matcher(string);

    while (regexMatcher.find()) {
        matchList.add(regexMatcher.group());
    }

    return matchList.toArray(new String[0]);
}

}

The output:

term1
term2
term3
((term4 term5))
term6

Hope this help you.

Please note that the following code with the usual split():

String[] parsedInput = input.split("\\w+|\\(\\([\\w\\s]*\\)\\)");

will return you nothing or not what you want beacuse it only check delimiters.

You can do this by extending WhitespaceTokenizer, but I expect it will be easier if you write a TokenFilter that reads from a WhitespaceTokenizer and pastes together consecutive tokens based on the number of parentheses.

Overriding incrementToken is the main task when writing a Tokenizer-like class. I once did this myself; the result might serve as an example (though for technical reasons, I couldn't make my class a TokenFilter).