
Lucene.NET: Camel case tokenizer?

开发者 https://www.devze.com 2023-01-15 10:22 出处:网络
I\'ve started playing with Lucene.NET today and I wrote a simple test method to do indexing and searching on source code files. The problem is that the standard analyze开发者_如何学Gors/tokenizers tre

I've started playing with Lucene.NET today and I wrote a simple test method to do indexing and searching on source code files. The problem is that the standard analyze开发者_如何学Gors/tokenizers treat the whole camel case source code identifier name as a single token.

I'm looking for a way to treat camel case identifiers like MaxWidth into three tokens: maxwidth, max and width. I've looked for such a tokenizer, but I couldn't find it. Before writing my own: is there something in this direction? Or is there a better approach than writing a tokenizer from scratch?

UPDATE: in the end I decided to get my hands dirty and I wrote a CamelCaseTokenFilter myself. I'll write a post about it on my blog and I'll update the question.

Solr has a WordDelimiterFactory which generates a tokenizer similar to what you need. Maybe you can translate the source code into C#.

Below link might be helpful to write custom tokenizer...


Here is my implementation :

package corp.sap.research.indexing;

import java.io.IOException;

import org.apache.lucene.analysis.TokenFilter;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;

public class CamelCaseFilter extends TokenFilter {

    private final CharTermAttribute _termAtt;

    protected CamelCaseScoreFilter(TokenStream input) {
        this._termAtt = addAttribute(CharTermAttribute.class);

    public boolean incrementToken() throws IOException {
        if (!input.incrementToken())
            return false;
        CharTermAttribute a = this.getAttribute(CharTermAttribute.class);
        String spliettedString = splitCamelCase(a.toString());
        return true;


    static String splitCamelCase(String s) {
           return s.replaceAll(
              " "


验证码 换一张
取 消
