开发者

MapReduce - What is the benefit in the word count example

开发者 https://www.devze.com 2023-04-05 00:49 出处:网络
I am trying to understand what is the benefit of MapReduce, I have just read some introductions on it for the first time.

I am trying to understand what is the benefit of MapReduce, I have just read some introductions on it for the first time.

They all use this canonical example of counting words in a large set of documents, but I am not seeing the benefit. The following is my curr开发者_高级运维ent understanding, correct me if I'm wrong.

We specify a list of input files (documents). The MapReduce library takes this list and divides it between the processors in the cluster. Each document at a processor is passed to the map function, which returns a list of pairs in this case.

Here is where I am a little unsure what exactly happens. Then the library software searches through the set of results on all the different processors, and groups together those pairs with the same word (key). These groups are collected at different processors, and reduce is called on each group at that processor.

Combined results are then collected on the master node.

Is this the correct interpretation?

What I don't understand is, as it's necessary to sort through all the results to group keys, why not just count the keys it finds at the same time, why is reduce needed at all? How does this process save time when it seems like there is a lot of work to find and combine common keys?


Here is a nice video in YouTube Video on MapReduce algorithm, if you watch the complete series of 5 videos it will give you much more clarity on MapReduce and answer most of your queries.

What I don't understand is, as it's necessary to sort through all the results to group keys, why not just count the keys it finds at the same time, why is reduce needed at all? How does this process save time when it seems like there is a lot of work to find and combine common keys?

Because key/value pair for a particular word like "sample" from the word count example might be emitted by different map tasks and will be distributed across different nodes, these key/value pairs need to be consolidated/sorted before sending to the reduce task. Reduce task for a particular key runs on a single node and are not distributed.

FYI, the results from the map task are combined using the combiner class (which is the same as the reducer class) on the same node as the map task to decrease the network chatter between the mappers and the reducers.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号