HDFS Rack Awareness
Rack is the collection of machines which are physically located in a single place\data-center connected through traditional network design and top of rack switching mechanism. In Hadoop, Rack is a physical collection of slave machines put together at a single location for data storage. There can be multiple racks in a single location.
In a large cluster of Hadoop, in order to improve the network traffic while reading/writing HDFS file, NameNode chooses the DataNode which is closer to the same rack or nearby rack to Read/Write request. NameNode achieves rack information by maintaining the rack ids of each DataNode. This concept that chooses closer DataNodes based on the rack information is called Rack Awareness in Hadoop.
Rack awareness is having the knowledge of Cluster topology or more specifically how the different data nodes are distributed across the racks of a Hadoop cluster. Default Hadoop installation assumes that all data nodes belong to the same rack. Here is the sample representation for Replication Rack awareness.
When the client is ready to load a file into the cluster, the content of the file will be divided into blocks(each Block size 128 MB) and then client consults the Name node and gets the address of data nodes for the default 3 replication copies for every block. While placing in the data nodes, the key rule followed is “for every block of data, two copies will exist in one rack, third copy in the different rack”. This rule is called as “Replica Placement Policy“.
Why Rack Awareness?
In Big data Hadoop, rack awareness is required for below reasons:
- To improve data high availability and reliability.
- To improve the performance of the cluster.
- To improve network bandwidth.
- To avoid losing data if entire rack fails though the chance of the rack failure is far less than that of node failure.
- To keep bulk data in the rack when possible.
- An assumption that in-rack ids higher bandwidth, lower latency.
Rack Awareness is important to improve
- Data high availability and reliability.
- The performance of the cluster.
- To improve network bandwidth.Â