Introduction to Hadoop and its eco system technologies

Apache Hadoop is an open-source framework that is used to efficiently store and process large datasets ranging in size from gigabytes to petabytes of data. Instead of using one large computer to store and process the data, Hadoop allows clustering multiple computers to analyze massive datasets in parallel more quickly.

In short, we can also say that Apache Hadoop is a framework to solve bigdata problems.

What is big data?

Data that contain greater variety, arriving in increasing volumes and with more velocity. This is also known as the three V’s of big data.

Traditional systems are not capable of storing and processing such huge amounts of data and Hadoop comes to the rescue.

Big data system requirements:

Store – Store massive amounts of data.

Process – Process the data promptly.

Scale - Scale easily as data grows.

Horizontal vs Vertical Scaling:-

https://padmanabha.hashnode.dev/vertical-vs-horizontal-scaling

Hadoop Architecture

At its core, Hadoop has two major layers namely − Processing/Computation layer (MapReduce), and the Storage layer (Hadoop Distributed File System).

MapReduce

MapReduce is a parallel programming model for writing distributed applications devised at Google for efficient processing of large amounts of data (multi-terabyte data-sets), on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner. The MapReduce program runs on Hadoop which is an Apache open-source framework.

Hadoop Distributed File System

The Hadoop Distributed File System (HDFS) is based on the Google File System (GFS) and provides a distributed file system that is designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant. It is highly fault-tolerant and is designed to be deployed on low-cost hardware. It provides high throughput access to application data and is suitable for applications having large datasets.

Apart from the above-mentioned two core components, Hadoop framework also includes the following two modules −

Hadoop Common − These are Java libraries and utilities required by other Hadoop modules.

Hadoop YARN − This is a framework for job scheduling and cluster resource management.

Evolution of Hadoop:

2003 – Google released a paper called GFS(Google File System) which describes how to store large datasets.

2004 – Google released a paper called MapReduce which describes how to process large datasets.

2006 – Yahoo took these papers and implemented them as HDFS(Hadoop Distributed File System) and MapReduce.

Hadoop 1.0 = HDFS(for distributed storage) + MapReduce(for distributed processing)

2009 – Hadoop came under Apache software foundation and became open source.

2013 – Apache released Hadoop 2.0 to provide major performance enhancements.

Hadoop 2.0 = HDFS + YARN(for resource management) + MapReduce

YARN – Yet another Resource Negotiator – Mainly responsible for resource management.

Hadoop Ecosystem technologies:-

Hive:- Datawarehouse tool built on top of Hadoop for providing data query and analysis. It gives an SQL-like interface to query data stored in various databases and file systems that integrate with Hadoop

Scoop:- A command line interface application for transferring data between relational databases and Hadoop.

https://padmanabha.hashnode.dev/apache-sqoop

Pig:- A scripting language for data manipulation. It is used to clean the data and to transform unstructured data into a structured format.

HBASE:- A column-oriented NoSql database that runs on top of HDFS.

OOZIE:- A workflow scheduler system to manage Apache Hadoop jobs.

SPARK:- A distributed general-purpose in-memory compute engine. Spark is written in Scala however, it officially supports Java, Scala, Python & R.