What is MapReduce in Hadoop ?

July 14, 2016

Explaining MapReduce with an example...

 

MapReduce is a programming model and an associated implementation for processing and generating large data sets with a parallel, distributed algorithm on a cluster. In this article, we will see how map reduce works in the Hadoop eco system with an example. Let’s assume, we have large data sets of temperature data recorded at different cities on a particular day at some specific intervals as below; the goal is  to figure out the highest and lowest temperature recorded for a day with respect to these cities. 

 

The following three steps will play a vital role in the MapReduce model.

  • Mapper

  • Combiner

  • Reducer

 

Mapper will do the split on the large datasets into  chunks of smaller datasets (sub-datasets). And  the computations will be done on each of the sub-datasets in parallel to get the required output. Splitting and Mapping will be done in the first step of a MapReduce. The output of this Split will be smaller datasets whereas the output of the mapping will be the key value pairs. Key value pairs will be formed with respect to the sub-datasets; meaning that for every dataset there will be  associated key value pairs.

 So, in this case, Mapper will produce  output like below:

Combiner will do shuffling of the key value pairs formed by the Mapper. The sub- dataset’s key-value pairs will be merged and sorted by their keys. The values will be grouped together with respect to their keys and they will be merged under the same key. This will produce the output of {Key = List<Values>}; then the values will be sorted out using their keys. The combiner will act as a mini-map reducer as follows:

Reducer will do the final step in MapReduce paradigm; this will compute the finally sorted datasets (sorted key value pairs) as per the requirement.

 

In our case, the highest and lowest temperature will be fetched off from the sorted datasets as follows:

The output of the Reducer is not re-sorted.

 

Some of the advantages of the map reduce framework include  its cost effectiveness, flexibility as well as scalability due to its inherent parallel processing architecture. The scalability of this framework enables businesses to run map reduce across a number of nodes that could involve huge volumes of data.

 

 

 

 

 

 

 

 

Please reload

Featured Posts

Its all about Apache Sqoop

July 22, 2016

1/10
Please reload

Recent Posts

July 13, 2016

July 13, 2016

Please reload

Archive
Please reload

Search By Tags

I'm busy working on my blog posts. Watch this space!

Please reload

Follow Us
  • Facebook Basic Square
  • Twitter Basic Square
  • Google+ Basic Square