开发者

Why getting different results with MALLET topic inference for single and batch of documents?

开发者 https://www.devze.com 2023-04-10 22:38 出处:网络
I\'m trying to perform LDA topic modeling with Mallet 2.0.7.I can train a LDA model and get good results, judging by the output from the training session.Also, I can use the inferencer built in that p

I'm trying to perform LDA topic modeling with Mallet 2.0.7. I can train a LDA model and get good results, judging by the output from the training session. Also, I can use the inferencer built in that process and get similar results when re-processing my training file. However, if I take an individual file from the larger training set, and process it with the inferencer I get very different results, which are not good.

My understanding is that the inferencer should be using a fixed model, and only features local to that document, so I do not understand why I would get any different results while processing 1 file or the 1k from my training set. I am not doing frequency cutoffs which would seem to be a global operation that would have this type of an effect. You can see other parameters I'm using in the commands below, but they're mostly default. Changing # of iterations to 0 or 100 didn't help.

Import data:

bin/mallet import-dir \
  --input trainingDataDir \
  --output train.data \
  --remove-stopwords TRUE \
  --keep-sequence TRUE \
  --gram-sizes 1,2 \
  --keep-sequence-bigrams TRUE

Train:

time ../bin/mallet train-topics
  --input ../train.data \
  --inferencer-filename lda-inferencer-model.mallet \
  --num-top-words 50 \
  --num-topics 100 \
  --num-threads 3 \
  --num-iterations 100 \
  --doc-topics-threshold 0.1 \
  --output-topic-keys topic-keys.txt \
  --output-doc-topics doc-topics.txt

Topics assigned during training to one file in particular, #14 is about wine which is correct:

998 file:/.../29708933509685249 14  0.31684981684981683 
> grep "^14\t" topic-keys.txt 
14  0.5 wine spray cooking car climate top wines place live honey sticking ice prevent collection market hole climate_change winery tasting california moldova vegas horses converted paper key weather farmers_market farmers displayed wd freezing winter trouble mexico morning spring earth round mici torrey_pines barbara kinda nonstick grass slide tree exciting lots 

Run inference on entire train batch:

../bin/mallet infer-topics \
  --input ../train.data \
  --inferencer lda-inferencer-model.mallet \
  --output-doc-topics inf-train.1 \
  --num-iterations 100

Inference score on train -- very similar:

998 /.../29708933509685249 14 0.37505087505087503 

Run inference on another training data file comprised of only that 1 txt file:

../bin/mallet infer-topics \
  --input ../one.data \
  --inferencer lda-inferencer-model.mallet \
  --output-doc-topics inf-one.2 \
  --num-iterations 100

Inference on one document produces topic 80 and 36, which are very different (14 is given near 0 score):

0 /.../29708933509685249 80 0.3184778184778185 36 0.19067969067969068
> grep "^80\t" topic-keys.txt 
80  0.5 tips dog care pet saf开发者_运维问答ety items read policy safe offer pay avoid stay important privacy services ebay selling terms person meeting warning poster message agree sellers animals public agree_terms follow pets payment fraud made privacy_policy send description puppy emailed clicking safety_tips read_safety safe_read stay_safe services_stay payment_services transaction_payment offer_transaction classifieds_offer 


The problem was incompatibility between small.data and one.data training data files. Even though I had been careful to use all of the same options, two data files will by default use different Alphabets (mapping between words and integers). To correct this, use the --use-pipe-from [MALLET TRAINING FILE] option, and then specifying other options seems to be unnecessary. Thanks to David Mimno.

bin/mallet import-dir \
  --input [trainingDataDirWithOneFile] \
  --output one.data \
  --use-pipe-from small.data 
0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号