开发者

classifier4J with compound words

开发者 https://www.devze.com 2023-01-07 23:32 出处:网络
I\'m using the BayesianClassifier class to classify spam. The problem is that compound words aren\'t being recognized.

I'm using the BayesianClassifier class to classify spam. The problem is that compound words aren't being recognized.

For instance if I add led zeppelin as 开发者_StackOverflow中文版a match, a sentence containing it won't be recognized as a match even though it should.

For adding a match I'm using addMatch() of SimpleWordsDataSource

And for asking for a match I'm using isMatch() of BayesianClassifier

Any ideas on how to fix this?


Ok, thanks for the insight. I'm attaching more source code.

SimpleWordsDataSource wds = new SimpleWordsDataSource();
BayesianClassifier classifier = new BayesianClassifier(wds);

wds.addMatch("queen");
wds.addMatch("led zeppelin");
wds.addMatch("the beatles");

classifier.isMatch("i listen to queen");// it is recognized as a match
classifier.isMatch("i listen to led zeppelin");// it is NOT recognized as a match
classifier.isMatch("i listen to the beatles");// it is NOT recognized as a match

Now I'm using the teachMatch method of BayesianClassifier and I've got different results. A sentence containing led zeppelin it is classified as a match, which is ok. But a sentence including led it is also classified as a match, which is wrong.

Here's the relevant code:

BayesianClassifier classifier = new BayesianClassifier();
classifier.teachMatch("led zeppelin");
classifier.isMatch("I listen to led zeppelin");//true
classifier.isMatch("I listen to led");//true


(I wrote classifier4j)

You need to train it with more data.

Bayesian classifiers work by creating statistical models of what is considered a match and what isn't.

If you give it enough data, it will learn that "led and zeppelin" is a match, but "led" by itself isn't

0

精彩评论

暂无评论...
验证码 换一张
取 消