开发者

What Solr tokenizer and filters can I use for a strong general site search?

开发者 https://www.devze.com 2023-04-12 20:45 出处:网络
I\'d like to ensure that searching for, say, I.B.M. can be found by searching for ibm. I\'d also like to make sure that Dismemberment Plan could be found by searching for dismember.

I'd like to ensure that searching for, say, I.B.M. can be found by searching for ibm. I'd also like to make sure that Dismemberment Plan could be found by searching for dismember.

Using Solr, what tokenizer and filters can I use in analysis 开发者_StackOverflowand query time to permit both kinds of results?


For I.B.M. => ibm
you would need a solr.WordDelimiterFilterFactory, which would strip special chars and catenate word and numbers

catenateWords="1" would catenate the words and transform I.B.M to IBM.

Dismemberment => dismember
Need to include a stemmer filter (e.g. solr.PorterStemFilterFactory, solr.EnglishMinimalStemFilterFactory) which would index the roots of the words and provide matches for words which have the same roots.

In addition you can use solr.LowerCaseFilterFactory for case insensitive matches (IBM and ibm), solr.ASCIIFoldingFilterFactory for handling foreign characters.

You can always use SynonymFilterFactory to map words which you think are synonyms.

you can apply this at both query and index time, so that they match and convert during both and the results are consistent.

e.g. field type def -

<fieldType name="text_en_splitting" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true">
  <!-- Index and Query time -->
  <analyzer type="index">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
    <filter class="solr.LowerCaseFilterFactory"/>
        <!-- Stemmer -->
    <filter class="solr.PorterStemFilterFactory"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.PorterStemFilterFactory"/>
  </analyzer>
</fieldType>

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号