I'm trying to use Dumbo/Hadoop to calculate TF-IDF for a bunch of small text files using this example http://dumbotics.com/2009/05/17/tf-idf-revisited/
To improve efficiency, I've packaged the text开发者_运维知识库 files into a sequence file using Stuart Sierra's tool -- http://stuartsierra.com/2008/04/24/a-million-little-files
The sequence file uses my original filenames (324324.txt [the object_id.txt]) as the key and the file contents as the value.
Problem is that each line of output looks like:
[aftershocks, s3://mybucket/input/test-seq-file] 7.606329176204189E-4
What I want is:
[aftershocks, 324324.txt] 7.606329176204189E-4
What am I doing wrong?
I'm running the job with:
dumbo start tfidf.py -hadoop /home/hadoop -input s3://mybucket/input/
test-seq-file -output s3://mybucket/output/test3 -param doccount=11 - outputformat text
I made the following tweaks to the first mapper and everything started working.
#Original version
@opt("addpath", "yes")
def mapper1(key, value):
for word in value.split():
yield (key[0], word), 1
#Edits version
def mapper1(key, value):
for word in value.split():
yield (key, word), 1
精彩评论