开发者

Solr, Special Chars, and Latin to Cyrillic char conversion

开发者 https://www.devze.com 2023-04-10 16:10 出处:网络
I am trying to setup a search engine using Solr (or Lucene) which could have text in both Latin with special chars, (special chars would include Ö or Ç as an example) or Cyrilic chars (examples incl

I am trying to setup a search engine using Solr (or Lucene) which could have text in both Latin with special chars, (special chars would include Ö or Ç as an example) or Cyrilic chars (examples include Б开发者_StackOverflow or б and Ж ж).

Anyway, I am trying to find a solution to allow me to search for words with these charicters in them, but for users who do not have the key on their keyboard...

Example would be (making up words here, hopefully won't offend anyone):

  • "BÖÖK" would be found when searching for "book"
  • "ЖRAY" would be found when searching for XRAY
  • "ЖRAY" would also be found if searching for ZRAY, ZHRAY, or žray (see GOST 16876-71 for info on Transliteration of Cylric to Latin Char.

So, how should I go about this? Some theories I have are:

  • allow multiple text fields to be stored for each original string, one in original form, one in the first pass of transliteration (which, for example, would convert Ö to just O and Ж to ž, but also X) and then one in the third form (from the ž to z or zh) -> means I will be storing a LOT of data...
  • store in solr as is, and let Solr do the magic -> don't know how well this will work... can't see anything in solr to do this
  • Magic bullet I have not found yet...

Any ideas? Anyone tried this before?


Take a look at Solr's Analyzers, Tokenizers, and Token Filters which give you a good intro to the type of manipulation you're looking for.


You need to use the accent filter in your index and query text analysis, which would convert foreign characters to their english version

You can use ISOLatin1AccentFilterFactory or ASCIIFoldingFilterFactory depending upon the Solr version you are using.

e.g.

 <filter class="solr.ASCIIFoldingFilterFactory" />

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ISOLatin1AccentFilterFactory
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ASCIIFoldingFilterFactory

So - "BÖÖK" would be converted and indexed as "book" in Solr.
This would enable the users to search for both, book and BÖÖK and still get back the document.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号