开发者

Hadoop job taking input files from multiple directories

开发者 https://www.devze.com 2023-02-02 02:50 出处:网络
I have a situation where I have multiple (100+ of 2-3 MB each) files in compressed gz format present in multiple directories. For Example

I have a situation where I have multiple (100+ of 2-3 MB each) files in compressed gz format present in multiple directories. For Example

A1/B1/C1/part-0000.gz

A2/B2/C2/part-000开发者_如何学C0.gz

A1/B1/C1/part-0001.gz

I have to feed all these files into one Map job. From what I see , for using MultipleFileInputFormat all input files need to be in same directory . Is it possible to pass multiple directories directly into the job?

If not , then is it possible to efficiently put these files into one directory without naming conflict or to merge these files into 1 single compressed gz file.

Note: I am using plain java to implement the Mapper and not using Pig or hadoop streaming.

Any help regarding the above issue will be deeply appreciated.

Thanks,

Ankit


FileInputFormat.addInputPaths() can take a comma separated list of multiple files, like

FileInputFormat.addInputPaths("foo/file1.gz,bar/file2.gz")
0

精彩评论

暂无评论...
验证码 换一张
取 消