Hadoop Map Only Job
Map-Only job in the Hadoop is the process in which mapper does all tasks. No task is done by the reducer. Mapper’s output is the final output. MapReduce is the data processing layer of Hadoop. It processes large structured and unstructured data stored in HDFS. MapReduce also processes a huge amount of data in parallel. It does this by dividing the job (submitted job) into a set of independent tasks (sub-job). In Hadoop, MapReduce works by breaking the processing into phases: Map and Reduce.
Map – Map is the first phase of processing, where we specify all the complex logic code. It takes a set of data and converts into another set of data. It breaks each individual element into tuples (key-value pairs).
Reduce – Reduce is the second phase of processing. Here we specify light-weight processing like aggregation/summation. It takes the output from the map as input. Then it combines those tuples based on the key.
From the above word-count example, we can say that there are two sets of parallel process, map and reduce; in map process, the first input is split to distribute the work among all the map nodes as shown in a figure, and then each word is identified and mapped to the number 1. Thus, the pairs called tuples (key-value) pairs.
In the first mapper node three words lion, tiger, and river are passed. Thus, the output of the node will be three key-value pairs with three different keys and value set to 1 and the same process repeated for all nodes. These tuples are then passed to the reducer nodes and practitioner comes into action. It carries out shuffling so that all tuples with the same key are sent to the same node. Thus, in reduce process basically what happens is an aggregation of values or rather an operation on values that share the same key.
Now, let us consider a scenario where we just need to perform the operation and no aggregation required, in such case, we will prefer ‘Map-Only job’ in Hadoop. In Hadoop Map-Only job, the map does all task with its InputSplit and no job is done by the reducer. Here map output is the final output.
We can achieve this by setting job.setNumreduceTasks(0) in the configuration in a driver. This will make a number of reducer as 0 and thus the only mapper will be doing the complete task.
Advantages of Map only job
In between map and reduces phases there is key, sort and shuffle phase. Sort and shuffle are responsible for sorting the keys in ascending order and then grouping values based on same keys. This phase is very expensive and if reduce phase is not required we should avoid it, as avoiding reduce phase would eliminate sort and shuffle phase as well. This also saves network congestion as in shuffling, an output of mapper travels to reducer and when data size is huge, large data needs to travel to the reducer.
The output of mapper is written to local disk before sending to reducer but in map only job, this output is directly written to HDFS. This further saves time and reduces cost as well. Also, there is no need of partitioner and combiner in Hadoop Map Only job that makes the process fast.