Hadoop RecordReader - Simplified Learning

RecordReader

MapReduce has a simple model of data processing. Inputs and Outputs for the map and reduce functions are key-value pairs. The map and reduce functions in Hadoop MapReduce have the following general form

map: (K1, V1) →  list (K2, V2)  
reduce: (K2, list(V2)) → list (K3, V3)

InputFormat class calls the getSplits() function and computes splits for each file and then sends them to the JobTracker, which uses their storage locations to schedule map tasks to process them on the TaskTrackers. Map task then passes the split to the createRecordReader() method on InputFormat in task tracker to obtain a RecordReader for that split. The RecordReader load’s data from its source and converts into key-value pairs suitable for reading by the mapper.

Hadoop RecordReader uses the data within the boundaries that are being created by the InputSplit and creates Key-value pairs for the mapper. The “start” is the byte position in the file where the RecordReader should start generating key/value pairs and the “end” is where it should stop reading records. In Hadoop RecordReader, the data is loaded from its source and then the data is converted into key-value pairs suitable for reading by the Mapper. It communicates with the InputSplit until the file reading is not completed.

How RecordReader works in Hadoop?

A RecordReader is more than iterator over records, and map task uses one record to generate key-value pair which is passed to the map function. We can see this by using mapper’s run function.

public void run(Context context) throws IOException, InterruptedException{ 
   setup(context); 
   while(context.nextKeyValue()) { 
      map(context.setCurrentKey(),context.getCurrentValue(),context) 

   }
   cleanup(context); 
}

After running setup(), the nextKeyValue() will repeat on the context, to populate the key and value objects for the mapper. The key and value is retrieved from the record reader by way of context and passed to the map() method to do its work. An input to the map function, which is a key-value pair(K, V), gets processed as per the logic mentioned in the map code. When the record gets to the end of the record, the nextKeyValue() method returns false.

A RecordReader usually stays in between the boundaries created by the inputsplit to generate key-value pairs but this is not mandatory. A custom implementation can even read more data outside of the inputsplit, but it is not encouraged a lot.

Types of RecordReader

The RecordReader instance is defined by the InputFormat. By default, it uses TextInputFormat for converting data into a key-value pair. TextInputFormat provides 2 types of RecordReaders

1. LineRecordReader

Line RecordReader in Hadoop is the default RecordReader that textInputFormat provides and it treats each line of the input file as the new value and associated key is byte offset. LineRecordReader always skips the first line in the split (or part of it), if it is not the first split. It read one line after the boundary of the split in the end (if data is available, so it is not the last split).

2. SequenceFileRecordReader

It reads data specified by the header of a sequence file. There is a maximum size allowed for a single record to be processed. This value can be set using below parameter.

conf.setInt("mapred.linerecordreader.maxlength", Integer.MAX_VALUE);

A line with a size greater than this maximum value (default is 2,147,483,647) will be ignored.

Hadoop RecordReader