Lucene doesn't search text having '_' [duplicate]_问答_开发者

Lucene doesn't search text having '_' [duplicate]

开发者 https://www.devze.com 2023-04-06 05:32 出处：网络

This question alre开发者_开发知识库ady has answers here: Closed 11 years ago. Possible Duplicate:

This question alre开发者_开发知识库ady has answers here: Closed 11 years ago.

Possible Duplicate:
Lucene search and underscores

I am using Lucene full text search for searching in my application.

But for example, if I search for 'Turbo_Boost' it returns 0 results.

For other text it works fine.

Any Idea?

Assuming you are using the StandardTokenizer, it will split on the underscore character.

You can get around this by providing your own Tokenizer which will keep the underscore in the Token that's returned (either through a combination of Filter instances or TokenFilter instances).

A general rule of thumb with Lucene is to tokenize your search queries using the same Tokenizer/Analyzer you used to index the data.

see http://wiki.apache.org/lucene-java/LuceneFAQ#Why_is_it_important_to_use_the_same_analyzer_type_during_indexing_and_search.3F

I can only think of a few reasons why your query would fail:

First, and probably the least likely, considering other text searches fine, you didn't set the document's field to be analyzed. It won't be tokenized, so you can only search against the exact value of the whole field. Again, this one is probably not your issue.

The second (related to the third), and fairly likely, would depend on how you're executing the search. If you are not using the QueryParser (which analyzes your text the same way you index it if constructed properly) and instead say you are using a TermQuery like:

var tq = new TermQuery("Field", "Turbo_Boost");

That could cause your search to possibly fail. This has to do with the Analyzer you used to index the document splitting or changing the case of "Turbo_Boost" when it was indexed, causing the string comparison at search-time to f

The third, and even more likely, has to do with the Analyzer class you're using to index your items, versus the one you're using to search with. Using the same analyzer is important, because each analyzer uses a different Tokenizer that splits the text into searchable terms.

Let me give you some examples using your own Turbo_Boost query on how each analyzer will split the text into terms:

KeywordAnalyzer, WhitespaceAnalyzer -> Field:Turbo_Boost

SimpleAnalyzer, StopAnalyzer -> Field:turbo Field:boost

StandardAnalyzer -> Field:turbo Field:boost

You'll notice some of the Analyzers are splitting the term on the underscore character, while KeywordAnalyzer keeps it. It is extremely important that you use the same analyzer when you search, because you may not get the same results. It can also cause issues where sometimes the query will find results and other times it won't, all this depending on the query used.

As a side note, if you are using the StandardAnalyzer, it's also important that you pass it the same Version to the IndexWriter and QueryParser, because there are differences in how the parsing is done depending on which version of Lucene you expect it to emulate.

My guess your issue is one of those above reasons.