Hadoop Sort map and reduce key value_问答_开发者

开发者 https://www.devze.com 2023-04-11 16:08 出处：网络

If I had a file with random integers on ea开发者_运维百科ch line and wanted to sort the file using Hadoop, what would my mapper and reducer\'s input/output key and value be?Yahoo has sorted Peta and T

相关专题：mapreduce

If I had a file with random integers on ea开发者_运维百科ch line and wanted to sort the file using Hadoop, what would my mapper and reducer's input/output key and value be?

Yahoo has sorted Peta and Tera Bytes of data. Others (including Google) do it on a regular basis, you can search for the sort benchmarks on the internet. Yahoo has published a paper on how they have done it.

The 'org.apache.hadoop.examples.terasort' package has sample code for sorting data.

Found some more information at the Cloudera blog here. There are some built-in classes to make sorting easier.

Total order partitions HADOOP-3019. As a spin-off from the TeraSort record, Hadoop now has library classes for efficiently producing a globally sorted output. InputSampler is used to sample a subset of the input data, and then TotalOrderPartitioner is used to partition the map outputs into approximately equal-sized partitions. Very neat stuff — well worth a look, even if you don’t need to use it.

You can also find more information here.

A more theoretical answer : Consider the different sorting algorithms (quick sort, merge sort, bubble sort, etc.... ) .

Because of the fact that we know you can merge two sorted lists in linear time , it is quite simple to parallelize any sorting algorithm by putting a "merge" step on top of it. Thus, there are a wide range of options which you could use to accomplish this task.

The terasort is much smarter than this, however, because simply splitting and merging won't solve all your problems.... your final "merge" step corresponds to a massive reduce step when you have alot of splits.