Design of HDFS
HDFS is a filesystem designed for storing very large files with streaming data access patterns, running on clusters of commodity hardware.
1. Very large files
“Very large” in this context means files that are hundreds of megabytes, gigabytes, or terabytes in size. There are Hadoop clusters running today that store petabytes of data.
2. Streaming data access
HDFS is built around the idea that the most efficient data processing pattern is a write-once, read-many-times pattern. A data set is typically generated or copied from source, and then various analyses are performed on that data set over time.
Each analysis will involve a large proportion, if not all, of the data set, so the time to read the whole data set is more important than the latency in reading the first record.
3. Commodity hardware
Hadoop doesn’t require expensive, highly reliable hardware. It’s designed to run on clusters of commodity hardware for which the chance of node failure across the cluster is high, at least for large clusters. HDFS is designed to carry on working without a noticeable interruption to the user in the face of such failure.