开发者

How to configure the indexer so that "word1.word2" is considered as two words

开发者 https://www.devze.com 2023-02-18 07:19 出处：网络

supose a file \'test.txt\' being indexed, the content of file is: word1.word2 what should I do to make lucene consider \"word1.word2\" as two word开发者_开发问答s \"word1\" and \"word2\" not \"word

相关专题：indexing lucene

supose a file 'test.txt' being indexed, the content of file is:

word1.word2

what should I do to make lucene consider "word1.word2" as two word开发者_开发问答s "word1" and "word2" not "word1.word2"

Lucene indexing with an analyzer will convert your words into Tokens of terms,(technically it converts the words into fields forming a document)

basically you can

1) create a StopAnalyzer and pass a HashSet with stop word as "."(period) this can have adverse effect on indexing(since you must use same analyzer while searching and indexing)

2) split the . with space and index them

That depends on which Analyzer you are using. The short generic answer would be to use a SimpleAnalyzer that uses a LetterTokenizer. The LetterTokenizer splits at any non-letter, thus including the dot character. If you have more specific tokenization requirements you must code a custom Analyzer class whose tokenStream method returns a custom TokenStream or Tokenizer object.