MapReduce Data Flow
Now let’s understand complete end to end data flow of Hadoop MapReduce, how input is given to the mapper, how mappers process data, where mappers write the data, how data is shuffled from mapper to reducer nodes, where reducers run, what type of processing should be done in the reducers?
As seen from the diagram, the square block is a slave. There are 3 slaves in the figure. On all 3 slaves mappers will run, and then a reducer will run on any 1 of the slave. For simplicity of the figure, the reducer is shown on a different machine but it will run on mapper node only. An input to a mapper is 1 block at a time. (Split = block by default). An output of mapper is written to a local disk of the machine on which mapper is running. Once the map finishes, this intermediate output travels to reducer nodes (node where reducer will run).
Reducer is the second phase of processing where the user can again write his custom business logic. Hence, an output of reducer is the final output written to HDFS. By default, on a slave, 2 mappers run at a time which can also be increased as per the requirements. It depends again on factors like DataNode hardware, block size, machine configuration etc. We should not increase the number of mappers beyond the certain limit because it will decrease the performance.
Mapper writes the output to the local disk of the machine it is working. This is the temporary data. An output of mapper is also called intermediate output. All mappers are writing the output to the local disk. As First mapper finishes, data (output of the mapper) is traveling from mapper node to reducer node. Hence, this movement of output from mapper node to reducer node is called shuffle.
Reducer is also deployed on any one of the DataNode only. An output from all the mappers goes to the reducer. All these outputs from different mappers are merged to form input for the reducer. This input is also on local disk. Reducer is another processor where you can write custom business logic. It is the second stage of the processing. Usually to reducer we write aggregation, summation etc. type of functionalities. Hence, Reducer gives the final output which it writes on HDFS.
Map and reduce are the stages of processing. They run one after other. After all, mappers complete the processing, then only reducer starts processing. Though 1 block is present at 3 different locations by default, but framework allows only 1 mapper to process 1 block. So only 1 mapper will be processing 1 particular block out of 3 replicas. The output of every mapper goes to every reducer in the cluster i.e every reducer receives input from all the mappers. Hence, framework indicates reducer that whole data has processed by the mapper and now reducer can process the data.
An output from mapper is partitioned and filtered to many partitions by the partitioner. Each of this partition goes to a reducer based on some conditions. Hadoop works with key value principle i.e. mapper and reducer gets the input in the form of key and value and write output also in the same form.