开发者

How can I use lucene's shingleanalyzerwrapper + standardanalyzer + indexreader?

开发者 https://www.devze.com 2023-03-04 22:38 出处:网络
I hope you can help me with this problem. What I intend to do: Given a right text, I want to count the frequencies for every stemmized token ngrams without the stopwords(in other words, the stopwords

I hope you can help me with this problem. What I intend to do: Given a right text, I want to count the frequencies for every stemmized token ngrams without the stopwords(in other words, the stopwords are already removed).

This is the situation:开发者_如何学Go I am indexing some texts with IndexWriter using ShingleAnalyzerWrapper + StandardAnalyzer and when I add a document to IndexWriter(like this: indexwriter.addDocument(doc, analyzer); where analyzer is again, ShingleAnalyzerWrapper + StandardAnalyzer ).

But the problem is: When I get the term frequencies and the terms, the stopwords seem to be substituted by underlines.

This is the input:

String text = "to i want to to i want to linked";

String text2 = "super by by hard easy ";

This is the output:

term:|freq:6

term: _|freq:2

term:_ hard|freq:1

term:_ i|freq:2

term:_ link|freq:1

term:easy|freq:1

term:hard|freq:1

term:hard easy|freq:1

term:i|freq:2

term:i want|freq:2

term:link|freq:1

term:super|freq:1

term:super _|freq:1

term:want|freq:2

term:want _|freq:2

If anything was unclear, please ask me so I try to make myself more clear

Thanks for the help


please see http://www.lucidimagination.com/search/document/e5681676403a007b/can_i_omit_shinglefilter_s_filler_tokens for some solutions.

In this case it seems like you probably want to disable position increments on your stopfilter, as you don't want to introduce a "hole" where the stopword was, you want to pretend like they never existed.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号