Hadoop Reducer
In Hadoop, Reducer takes the output of the Mapper (intermediate key-value pair) process each of them to generate the output. The output of the reducer is the final output, which is stored in HDFS. Usually, in the Hadoop Reducer, we do aggregation or summation sort of computation.
What is Hadoop Reducer?
The Hadoop Reducer process the output of the mapper. After processing the data, it produces a new set of output. At last HDFS stores this output data. Hadoop Reducer takes a set of an intermediate key-value pair produced by the mapper as the input and runs a Reducer function on each of them. One can aggregate, filter, and combine this data (key, value) in a number of ways for a wide range of processing. Reducer first processes the intermediate values for particular key generated by the map function and then generates the output (zero or more key-value pair).
One-one mapping takes place between keys and reducers. Reducers run in parallel since they are independent of one another. The user decides the number of reducers. By default number of reducers is 1.
Phases of Reducer in MapReduce
As you can see in the diagram at the top, there are 3 phases of Reducer in Hadoop MapReduce. Let’s discuss each of them one by one
1. Shuffle Phase
In this phase, the sorted output from the mapper is the input to the Reducer. In Shuffle phase, with the help of HTTP, the framework fetches the relevant partition of the output of all the mappers.
2. Sort Phase
In this phase, the input from different mappers is again sorted based on the similar keys in different Mappers. The shuffle and sort phases occur concurrently.
3. Reduce Phase
In this phase, after shuffling and sorting, reduce task aggregates the key-value pairs. The OutputCollector.collect() method, writes the output of the reduce task to the Filesystem. Reducer output is not sorted.
How many Reducers in Hadoop?
With the help of Job.setNumreduceTasks(int) the user set the number of reducers for the job. The right number of reducers are 0.95 or 1.75 multiplied by (<no. of nodes> * <no. of the maximum container per node>).
With 0.95, all reducers immediately launch and start transferring map outputs as the maps finish. With 1.75, the first round of reducers is finished by the faster nodes and second wave of reducers is launched doing a much better job of load balancing.
Increasing the number of reducers
- Increases the Framework overhead.
- Increases load balancing.
- Lowers the cost of failures.