Hadoop Combiner
Hadoop Combiner is also known as “Mini-Reducer” that summarizes the Mapper output record with the same Key before passing to the Reducer.
On a large dataset when we run MapReduce job, large chunks of intermediate data generated by Mapper. Then the framework passes this intermediate data on the Reducer for further processing. This leads to enormous network congestion. The Hadoop framework provides a function known as Combiner that plays a key role in reducing network congestion.
The primary job of Combiner is to process the output data from the Mapper, before passing it to Reducer. It runs after the mapper and before the Reducer and its use is optional.
MapReduce program without Combiner
Consider in the above diagram, no combiner is used. Here Input is split into two mappers and 9 keys are generated from the mappers. Now we have (9 key/value) intermediate data, the further mapper will send directly this data to reducer and while sending data to the reducer, it consumes some network bandwidth (bandwidth means time taken to transfer data between 2 machines). It will take more time to transfer data to reducer if the size of data is big. Now in between mapper and reducer if we use a hadoop combiner, then combiner shuffles intermediate data (9 key/value) before sending it to the reducer and generates 4 key/value pair as an output.
MapReduce program with Combiner in between Mapper and Reducer
Reducer now needs to process only 4 key/value pair data which is generated from 2 combiners. Thus, reducer gets executed only 4 times to produce final output, which increases the overall performance. Now from the above diagram, if we use a combiner in between mapper and reducer. Then combiner will shuffle 9 key/value before sending it to the reducer. And then generates 4 key/value pair as an output.
Advantages of Combiner
- Use of combiner reduces the time taken for data transfer between mapper and reducer.
- Combiner improves the overall performance of the reducer.
- It decreases the amount of data that reducer has to process.
Disadvantages of Hadoop combiner
- In the local filesystem, when Hadoop stores the key-value pairs and run the combiner later this will cause expensive disk IO.
- MapReduce jobs can’t depend on the combiner execution as there is no guarantee in its execution.