HDFS Architecture and working

Hadoop Distributed File System(HDFS) is the world’s most reliable storage system. It is best known for its fault tolerance and high availability.

What is Hadoop HDFS?

HDFS stores very large files running on a cluster of commodity hardware. It works on the principle of storage of less number of large files rather than a huge number of small files. HDFS stores data reliably even in the case of hardware failure. It provides high throughput by providing data access in parallel.

Apache HDFS or Hadoop Distributed File System is a block-structured file system where each file is divided into blocks of a pre-determined size(128 Mb). These blocks are stored across a cluster of one or several machines.

Apache Hadoop HDFS Architecture follows a Master/Slave Architecture, where a cluster comprises a single NameNode (Master node) and all the other nodes are DataNodes (Slave nodes). HDFS can be deployed on a broad spectrum of machines that support Java. Though one can run several DataNodes on a single machine, but in the practical world, these DataNodes are spread across various machines.

Note:- Increasing the block size may lead to underutilization of the cluster and decreasing the block size drastically will result in a huge amount of metadata which will overburden the nameNode.

NameNode:-

NameNode is the master node in the Apache Hadoop HDFS Architecture that maintains and manages the blocks present on the DataNodes (slave nodes). NameNode is a very highly available server that manages the File System Namespace and controls access to files by clients. NameNode holds the metadata (in the form of a table) about the data in the data nodes.

Note: The client's request always first goes to the NameNode which in turn will respond to the client with the required metadata to access the data from the dataNodes.

DataNode:-

DataNodes are the slave nodes in HDFS. The data actually resides in these data nodes. Unlike NameNode, DataNode is a commodity hardware, that is, a non-expensive system which is not of high quality or high availability.

HDFS Assumptions and Goals:-

Hardware failure:

Hardware failure is no more exception; it has become a regular term. HDFS instance consists of hundreds or thousands of server machines, each of which stores part of the file system’s data. There exists a huge number of components that are very susceptible to hardware failure. This means that some components are always non-functional. So the core architectural goal of HDFS is quick and automatic fault detection/recovery.

Streaming data access:

HDFS applications need streaming access to their datasets. Hadoop HDFS is mainly designed for batch processing rather than interactive use by users. The force is on high throughput of data access rather than low latency of data access. It focuses on how to retrieve data at the fastest possible speed while analyzing logs.

Large datasets:

HDFS works with large data sets. In standard practices, a file in HDFS is of size ranging from gigabytes to petabytes. The architecture of HDFS should be designed in such a way that it should be best for storing and retrieving huge amounts of data. HDFS should provide high aggregate data bandwidth and should be able to scale up to hundreds of nodes on a single cluster. Also, it should be good enough to deal with tons of millions of files in a single instance.

Simple coherency model:

It works on a theory of write-once-read-many access model for files. Once the file is created, written, and closed, it should not be changed. This resolves the data coherency issues and enables high throughput of data access. A MapReduce-based application or web crawler application perfectly fits in this model. As per Apache notes, there is a plan to support appending writes to files in the future.

Moving computation is cheaper than moving data:

If an application does the computation near the data it operates on, it is much more efficient than done far off. This fact becomes stronger while dealing with large data sets. The main advantage of this is that it increases the overall throughput of the system. It also minimizes network congestion. The assumption is that it is better to move computation closer to data instead of moving data to computation.

Portability across heterogeneous hardware and software platforms:

HDFS is designed with the portable property so that it should be portable from one platform to another. This enables the widespread adoption of HDFS. It is the best platform for dealing with a large set of data.

Failure management of DataNodes:-

Replication Management:

HDFS provides a reliable way to store huge data in a distributed environment as data blocks. The blocks are also replicated to provide fault tolerance. The default replication factor is 3 which is again configurable. So, each block is replicated three times and stored on different data nodes(considering the default replication factor).

Therefore, if we are storing a file of 128 MB in HDFS using the default configuration, we will end up occupying a space of 384 MB (3*128 MB) as the blocks will be replicated three times and each replica will be residing on a different DataNode.

Erasure coding:-

It is an alternative to the traditional data replication method mentioned above. Erasure coding is a data protection technique used in distributed storage systems, including Hadoop, to ensure fault tolerance and data durability. It involves dividing data into smaller blocks and generating additional blocks called parity blocks.

In erasure coding, parity blocks are calculated using mathematical algorithms such as XOR or Reed-Solomon codes. These parity blocks contain redundant information that is derived from the original data blocks. The purpose of parity blocks is to provide fault tolerance and enable data recovery in the event of node or block failures.

By distributing the redundant information across different nodes or storage devices, erasure coding allows for efficient reconstruction of lost or damaged data. The number of parity blocks generated depends on the specific erasure coding scheme and the desired level of fault tolerance. The presence of parity blocks enables the system to reconstruct the original data by combining the remaining data blocks and the parity blocks.

Compared to traditional replication techniques, erasure coding offers higher storage efficiency because it requires less overhead. The redundancy provided by parity blocks allows for data recovery even when multiple blocks or nodes become unavailable. This makes erasure coding an effective approach for achieving data durability and fault tolerance in distributed storage systems like Hadoop.

Heart Beat:

Each dataNode sends heart beats to NameNode once every 3 seconds. If a NameNode doesn’t receive 10 consecutive heart beats, it assumes that the dataNode is dead or running very slow and will mark it for deletion i.e the corresponding entries will be deleted from the metadata table. The heartbeat signal contains information such as the amount of free space on the DataNode's local disk and the list of blocks that are currently being served by the DataNode.

Fault Tolerance:

If a dataNode goes down, as a result, the replication factor comes down and so as to maintain the default replication factor, the nameNode will create one more copy of the data blocks that were residing on the dead dataNodes.

Failure management of NameNode:-

NameNode was a single point of failure in Hadoop version 1. However, that is no longer the case in Hadoop 2.0. If a nameNode fails, we will lose all the block mapping information. Therefore, to make sure that no downtime is involved, we need to have the latest block mapping information at all times, and the following are the two things that help achieve the same –

Two important metadata files - fsimage + edit logs and Secondary NameNode

FsImage is a file stored on the OS filesystem that contains the complete directory structure (namespace) of the HDFS with details about the location of the data on the Data Blocks and which blocks are stored on which node (or) It is a snapshot of the in-memory filesystem at a given moment.

EditLogs is a transaction log that records the changes in the HDFS file system or any action performed on the HDFS cluster such as the addition of a new block, replication, or deletion etc., It records the changes since the last FsImage was created.

The fsimage and editlogs are stored in a shared location that is accessible to both NameNode and Secondary NameNode. Merging fsimage with editlogs will give the latest fsimage. But this merging is a compute heavy process.

Secondary NameNode:

The task of merging(checkpointing) of the fsimage image and editlogs is performed by the secondary nameNode. Once the process is completed the fsimage and editlogs are reset to empty. This process repeats every 30 seconds(default setting).

If the NameNode fails, the SNN becomes the active NameNode and checkpointing doesn’t happen anymore. It is the responsibility of the Hadoop admin to introduce a new secondary NameNode for doing checkpointing.

Quorum Journal Manager (QJM):-

It is a component that enables high availability for the Namenode. QJM is a distributed journal that stores edits (changes made to the file system namespace) in a highly available manner. The QJM consists of a set of JournalNodes (JNs) that replicate the edits across a set of nodes to ensure fault tolerance.

The Namenode in HDFS can use the QJM to maintain a consistent namespace even if the primary Namenode fails. When a Namenode writes edits to the QJM, it sends them to a majority of the JournalNodes (i.e., a quorum) in the cluster. Once a majority of the JournalNodes have acknowledged the edits, the Namenode can consider them to be safely stored. This ensures that the edits are durable and available even if some of the JournalNodes fail.

In the event of a primary Namenode failure, a standby Namenode can read the edits from the QJM and apply them to its own namespace, ensuring that it has an up-to-date view of the file system metadata. The QJM acts as a shared log for both the active and standby Namenodes, allowing them to maintain a consistent view of the file system namespace.

Namenode scalability:-

Namenode Federation:-

Namenode Federation is a feature in Hadoop Distributed File System (HDFS) that allows multiple independent Namenodes to be deployed in a single cluster, each managing a subset of the namespace. This enables the cluster to scale horizontally and handle a larger number of files and directories, while also improving the reliability and availability of the Namenodes.

In the Namenode Federation, each independent Namenode manages a portion of the file system namespace and is responsible for managing the block locations for files in its namespace. This allows for a better distribution of the metadata across multiple Namenodes, which reduces the load on any single Namenode and improves overall performance. Clients can access files in the namespace of any of the Namenodes, and the Namenodes coordinate with each other to ensure consistency across the entire file system.

Namenode Federation also improves the availability of the Namenodes by allowing for failover between them. If one Namenode fails, another Namenode can take over its responsibilities, reducing the risk of downtime and data loss. Additionally, Namenode Federation allows for rolling upgrades of the cluster, where one Namenode can be upgraded at a time while the others continue to operate normally, reducing the impact of upgrades on the overall cluster availability.

Overall, Namenode Federation is a key feature in HDFS that allows for improved scalability, reliability, and availability, making it a popular choice for handling large-scale distributed file systems.

Rack awareness mechanism: -

What is rack awareness in Hadoop HDFS?

The process of making Hadoop aware of what machine is part of which rack and how these racks are connected within the Hadoop cluster is what defines rack awareness. In a Hadoop cluster, NameNode keeps the rack IDs of all the DataNodes. Namenode chooses the closest DataNode while storing the data blocks using the rack information. In simple terms, having the knowledge of how different data nodes are distributed across the racks or knowing the cluster topology in the Hadoop cluster is called rack awareness in Hadoop. Rack awareness is important as it ensures data reliability and helps to recover data in case of a rack failure.

Why rack awareness?

Hadoop keeps multiple copies of all data that is present in HDFS. If Hadoop is aware of the rack topology, each copy of data can be kept in a different rack. By doing this, in case an entire rack suffers a failure for some reason, the data can be retrieved from a different rack.

Replication of data blocks in multiple racks in HDFS via rack awareness is done using a policy called Replica Replacement Policy. The policy states that “No more than one replica is placed on one node. And no more than 2 replicas are placed on the same rack.”

Block report:-

A block report contains the metadata of data blocks that a Data node holds. Data nodes in a Hadoop cluster periodically send this block report to the Namenode. Block report is important to make the system fault tolerant. If a NameNode detects a corrupted block, it will take the necessary action to maintain replication.

HDFS Read and Write Operation:-

Write Operation:

When a client wants to write a file to HDFS, it communicates to the NameNode for metadata. The Namenode responds with a number of blocks, their location, replicas, and other details. Based on information from NameNode, the client directly interacts with the DataNode.

The client first sends block A to DataNode 1 along with the IP of the other two DataNodes where replicas will be stored. When Datanode 1 receives block A from the client, DataNode 1 copies the same block to DataNode 2 of the same rack. As both the DataNodes are in the same rack, so block transfer via rack switch. Now DataNode 2 copies the same block to DataNode 4 on a different rack. As both the DataNoNes are in different racks, so block transfer via an out-of-rack switch. When DataNode receives the blocks from the client, it sends a confirmation to Namenode. The same process is repeated for each block of the file.

Read Operation:

To read from HDFS, the client first communicates with the NameNode for metadata. The Namenode responds with the locations of DataNodes containing blocks. After receiving the DataNodes locations, the client then directly interacts with the DataNodes.

The client starts reading data parallelly from the DataNodes based on the information received from the NameNode. The data will flow directly from the DataNode to the client. When a client or application receives all the blocks of the file, it combines these blocks into the form of an original file.