开发者

How to configure the indexer so that "word1.word2" is considered as two words

开发者 https://www.devze.com 2023-02-18 07:19 出处:网络
supose a file \'test.txt\' being indexed, the content of file is: word1.word2 what should I do to make lucene consider \"word1.word2\" as two word开发者_开发问答s \"word1\" and \"word2\" not \"word

supose a file 'test.txt' being indexed, the content of file is:

word1.word2

what should I do to make lucene consider "word1.word2" as two word开发者_开发问答s "word1" and "word2" not "word1.word2"


Lucene indexing with an analyzer will convert your words into Tokens of terms,(technically it converts the words into fields forming a document)

basically you can

1) create a StopAnalyzer and pass a HashSet with stop word as "."(period) this can have adverse effect on indexing(since you must use same analyzer while searching and indexing)

2) split the . with space and index them


That depends on which Analyzer you are using. The short generic answer would be to use a SimpleAnalyzer that uses a LetterTokenizer. The LetterTokenizer splits at any non-letter, thus including the dot character. If you have more specific tokenization requirements you must code a custom Analyzer class whose tokenStream method returns a custom TokenStream or Tokenizer object.

0

精彩评论

暂无评论...
验证码 换一张
取 消