The best IR software for my use?_问答_开发者_运维开发者技术经验分享

开发者 https://www.devze.com 2023-04-05 18:17 出处：网络

I want to take what people chat about in a chat room and do the following information retrieval: Get the keywords

I want to take what people chat about in a chat room and do the following information retrieval:

Get the keywords
Ignore all noise words, keep verb an nouns mainly
Perform stemming on the keywords so that I don't store the same keyword in many forms
If a synonym keyword is already stored in my storage then the existing synonym should be used instead of the new keyword
Store the processed keyword in a persistant storage with a reference to the chat m开发者_JAVA百科essage it was located in and the user who uttered it

With this prosessed information I want to slowly get an idea of what people are talking about in chatrooms, and then use this to automatically find related chatrooms etc. based on these keywords.

My question to you is a follows: What is the best C/C++ or .NET tools for doing the above?

I partially agree with @larsmans comment. Your question, in practice, may indeed be more complex than the question you posted.

However, simplifying the question/problem, I guess the answer to your question could be one of Lucene's implementation: Lucene (Java), Lucene.Net (C#) or CLucene (C++).

Following the points in your question:

Lucene would take care of point 1 by using String tokenizers (you can customize or use your own). For point 2 you could use a TokenFilter like StopFilter so Lucene can read a list of stopwords ("the", "a", "an"...) that it should not use. For point 3 you could use PorterStemFilter. Point 4 is a little bit trickier, but could be done using a customized TokenFilter. Point 1 to 4 are perfomed in the Analysis/tokenization phase, which an Analyzer is responsible.

Regarding point 5, in Lucene you can store Documents with fields. A document can have an arbitrary number and mix of fields. So you could create a single Document for each chat room with all its text concatenated, and have another field of the document reference the chatroom it was extracted from. You will end up with a bunch of Lucene documents that you can compare. So you can compare your current chat room with others to see which one is more similar to the one you are on.

If all you want is a set of the best keywords to describe a chatrom your needs are closer to information extraction/automatic summarization/topic spotting task as @larsmans said. But you can still use Lucene for the parsing/tokenization phase.

*I referenced the Java docs, but CLucene and Lucene.Net have very similar APIs so it won't be much trouble to figure out the differences.