HDFS Architecture

Apache HDFS (Hadoop Distributed File System) is a block-structured file system where each file is divided into blocks of a pre-determined size. These blocks are stored across a cluster of one or several machines. Apache Hadoop HDFS Architecture follows a Master/Slave Architecture, where a cluster consists of a single NameNode (Master node) and all the other nodes are DataNodes (Slave nodes). Though one can run several DataNodes on a single machine, but in the practical world, these DataNodes are spread across various machines.

1. HDFS NameNode

NameNode is the master node in the Apache Hadoop HDFS Architecture that maintains and manages the blocks present on the DataNodes (slave nodes). NameNode is a very highly available server that manages the File System Namespace and controls access to files by clients. The HDFS architecture is built in such a way that the user data never resides on the NameNode. The data resides on DataNodes only.

NameNode is the main central component of HDFS architecture framework.
NameNode is also known as Master node.
HDFS Namenode stores meta-data i.e. number of data blocks, file name, path, Block IDs, Block location, no. of replicas, and also Slave related configuration. This meta-data is available in memory in the master for faster retrieval of data.
NameNode keeps metadata related to the file system namespace in memory, for quicker response time. Hence, more memory is needed. So NameNode configuration should be deployed on reliable configuration.
NameNode maintains and manages the slave nodes, and assigns tasks to them.
NameNode has knowledge of all the DataNodes containing data blocks for a given file.
NameNode coordinates with hundreds or thousands of data nodes and serves the requests coming from client applications.
Two files ‘FSImage’ and the ‘EditLog’ are used to store metadata information.
- FsImage: It is the snapshot the file system when Name Node is started. It is an “Image file”. FsImage contains the entire filesystem namespace and stored as a file in the NameNode’s local file system. It also contains a serialized form of all the directories and file inodes in the filesystem. Each inode is an internal representation of file or directory’s metadata.
- EditLogs: It contains all the recent modifications made to the file system on the most recent FsImage. NameNode receives a create/update/delete request from the client. After that this request is first recorded to edits file.

Functions of NameNode

It is the master daemon that maintains and manages the DataNodes (slave nodes).
It records the metadata of all the files stored in the cluster, e.g. The location of blocks stored, the size of the files, permissions, hierarchy, etc. FsImage and EditLogs files associated with the metadata
It records each change that takes place to the file system metadata. For example, if a file is deleted in HDFS, the NameNode will immediately record this in the EditLog.
It regularly receives a Heartbeat and a block report from all the DataNodes in the cluster to ensure that the DataNodes are live.
It keeps a record of all the blocks in HDFS and in which nodes these blocks are located.
The NameNode is also responsible to take care of the replication factor of all the blocks.
In case of the DataNode failure, the NameNode chooses new DataNodes for new replicas, balance disk usage and manages the communication traffic to the DataNodes.

2. HDFS DataNode

DataNodes are the slave nodes in HDFS. Unlike NameNode, DataNode is a commodity hardware, that is, a non-expensive system which is not of high quality or high-availability. The DataNode is a block server that stores the data in the local file ext3 or ext4.

DataNode is also known as Slave node.
In Hadoop HDFS Architecture, DataNode stores actual data in HDFS.
DataNodes responsible for serving, read and write requests for the clients.
DataNodes can deploy on commodity hardware.
DataNodes sends information to the NameNode about the files and blocks stored in that node and responds to the NameNode for all filesystem operations.
When a DataNode starts up it announce itself to the NameNode along with the list of blocks it is responsible for.
DataNode is usually configured with a lot of hard disk space. Because the actual data is stored in the DataNode.

Functions of DataNode in HDFS

These are slave daemons or process which runs on each slave machine.
The actual data is stored on DataNodes.
The DataNodes perform the low-level read and write requests from the file system’s clients.
Every DataNode sends a heartbeat message to the Name Node every 3 seconds and conveys that it is alive. In the scenario when Name Node does not receive a heartbeat from a Data Node for 10 minutes, the Name Node considers that particular Data Node as dead and starts the process of Block replication on some other Data Node.
All Data Nodes are synchronized in the Hadoop cluster in a way that they can communicate with one another and make sure of
- Balancing the data in the system
- Move data for keeping high replication
- Copy Data when required.

3. HDFS Secondary NameNode

NameNode holds the meta data for the HDFS like Namespace information, block information etc. When in use, all this information is stored in main memory. But this information also stored in disk for persistence storage.

The above image shows how Name Node stores information in disk.
Two different files are
1. fsimage – It is the snapshot of the filesystem when NameNode started
2. Edit logs – It is the sequence of changes made to the filesystem after NameNode started.

When the NameNode is in the active state the edit logs size grows continuously. Only in the restart of NameNode , edit logs are applied to fsimage to get the latest snapshot of the file system. But NameNode restart are rare in production clusters which means edit logs can grow very large for the clusters where NameNode runs for a long period of time. The following issues we will encounter in this situation.

Editlog become very large, which will be challenging to manage it.
Namenode restart takes long time because lot of changes has to be merged.
In the case of crash, we will have lost huge amount of metadata since fsimage is very old.

So to overcome this issues we need a mechanism which will help us reduce the edit log size which is manageable and have up to date fsimage, so that load on NameNode reduces. It’s very similar to Windows Restore point, which will allow us to take snapshot of the OS so that if something goes wrong, we can fall back to the last restore point.

Secondary NameNode

Secondary NameNode helps to overcome the above issues by taking over responsibility of merging editlogs with fsimage from the NameNode.

Secondary NameNode also contains a namespace image and edit logs like NameNode. Now after every certain interval of time (which is one hour by default) it copies the namespace image from NameNode and merge this namespace image with the edit log and copy it back to the NameNode so that NameNode will have the fresh copy of namespace image. Now let’s suppose at any instance of time the NameNode goes down and becomes corrupt then we can restart some other machine with the namespace image and the edit log that’s what we have with the Secondary NameNode and hence can be prevented from a total failure.

Secondary Name node takes almost the same amount of memory and CPU for its working as the NameNode. So, it is also kept in a separate machine like that of a NameNode.

Functions of Secondary NameNode

The Secondary NameNode is one which constantly reads all the file systems and metadata from the RAM of the NameNode and writes it into the hard disk or the file system.
It is responsible for combining the EditLogs with FsImage from the NameNode.
It downloads the EditLogs from the NameNode at regular intervals and applies to FsImage. The new FsImage is copied back to the NameNode, which is used whenever the NameNode is started the next time.

Hence, Secondary NameNode performs regular checkpoints in HDFS. Therefore, it is also called CheckpointNode. Things have been changed over the years especially with Hadoop 2.x. Now Namenode is highly available with fail over feature. Secondary Namenode is optional now & Standby Namenode has been to used for failover process. Standby NameNode will stay up-to-date with all the file system changes the Active NameNode makes.

In Hadoop 2.x. there are two types of NameNodes in a cluster

Active NameNode
StandBy NameNode

The Active NameNode is responsible for all client operations in the cluster, while the Standby NameNode is simply acting as a slave. The Active NameNode is a single point of failure for the HDFS cluster. When the NameNode goes down, the file system will go offline. There is an optional Standby NameNode that can be hosted on a separate machine. So when active NameNode goes down, standby name node can be used as active NameNode and file system will be safe. This process is called failover.

Note

1. Heartbeat

Heartbeat is the signal that your DataNodes sends to the NameNode periodically/continuously.

What is the need of Heartbeat?

In Hadoop, NameNode and DataNode are two physically separated machines, then how NameNode know perticular DataNode is down? To make NameNode aware of the status, each DataNode does the following,

NameNode and DataNode do communicate using Heartbeat.
DataNodes send a Heartbeat to the NameNode every 3 seconds by default to indicate its presence, i.e. to indicate that it is alive.
If NameNode doesn’t receive the Heartbeat for a perticular amount of time (by default 10 minutes) then NameNode considered as that DataNode/Slave is dead node.
You can set the heartbeat interval in the hdfs-site.xml file by configuring the parameter dfs.heartbeat.interval.

2. Balancing

If one data node is down then some of the blocks are under replicated. This is not my design condition means my replication factor is 3 but some blocks are not replicated as 3 due to DataNode failure. So, we can balance the cluster. We don’t do anything. Just need to run command. HDFS balancer doesn’t run at background, has to run manually. To run HDFS balancer Command

            hdfs balancer [-threshold <threshold>]

As soon as NameNode receives the balancer command then NameNode will check this particular DataNode is down. Now NameNode checks in meta data what are the blocks available in that Node and then NameNode will issue an instruction to DataNode to replicate what are the blocks available in that DataNode. Now DataNode replicate the blocks 1 more copy to another DataNode. (For example, failed DataNode have B1 and B4 blocks only those replicated blocks available in another node will replicate). This is my desire condition i.e., all my blocks are replicated as I configured.

If any kind of situation if failed DataNode is up and running fine. Now my blocks are over replicated. This is not my desired condition? In previous case blocks are over replicated that’s why NameNode created one more replica. Now it is over replicated. Now what NameNode will it do? Now NameNode will issue an instruction to DataNode to delete the one copy of replica. (Not newly created replica. It depends. Delete blocks randomly which are not processed/accessed by the client.

3. Replication

Replication factor is nothing but how many copies of your HDFS block should be copied (replicated) in Hadoop cluster. Using Replication factor, we can achieve Fault Tolerant and High Availability. Fault tolerance is the property that enables a system to continue operating properly in the event of the failure of some of its components. Similarly, In Hadoop also when one DataNode is down, client access data without any interruption and loss. Because we have copy of same in another DataNode.

The default replication factor is 3 which can be configured as per our requirement. It can be changed to 2 (less than 3) or can be increased (more than 3). Using “dfs.replication” in hdfs-site.xml file we can configure replication factor.

Example

<property>
   <name>dfs.replication</name>
   <value>3</value>
</property>

The ideal replication factor is considered to be 3. Because of following reasons, To make HDFS Fault Tolerant we have to consider 2 cases

DataNode failure
Rack Failure

So, if one DataNode will down another copy will be available another DataNode. Suppose the 2nd Data Node also goes down but still that Data will be Highly Available as we still one copy of it stored onto different DataNode.

HDFS stores at least 1 replica to another Rack. So, in the case of entire Rack failure then you can get the data from the DataNode from another rack. Replication Factor also depends on Rack Awareness, which states the rule as “No more than one copy on the same node and no more than 2 copies in the same rack”, with this rule data remains highly available if the replication factor is 3. So, to achieve HDFS Fault Tolerant ideal replication factor is considered to be 3.

HDFS Architecture