Apache Hadoop is an open-source software program developed to work with massive amounts of data. It does this by sharing portions of the data across many computers, replicating much of the data for redundancy. This software and its computing model make the handling of massive data amounts faster than with traditional mainframes or supercomputers.
The Apache Hadoop software framework achieves this through distributed storage and processing over clusters of commodity hardware, computers commonly available to all of us. Each cluster of hardware units can be viewed as a single unit. Each cluster consists of multiple processing and storage units at one location. Different clusters reside in different locations. For example, your worksite might have a cluster or collection of five computers, a location in a different state may have a cluster of four, and so on. This allows thousands of computers to be involved.
The software assumes high rates of hardware failure and handles potential issues by replicating data across various nodes, or computers, in the same and different clusters. When Hadoop senses a failure, it typically duplicates the information in process twice on nodes in the local cluster and once on a node in a different cluster location.
Components and Function
Apache Hadoop consists of four key components, with multiple add-on subprograms available. The four core modules are:
- Hadoop Common: The libraries and utilities used by the other Hadoop modules and needed by the Hadoop program to work.
- Hadoop Distributed File System (HDFS): A distributed file system that stores data across many different computers in different locations. This creates a much higher overall bandwidth for processing than a single machine or server, making for fast handling. It is written in Java.
- Hadoop MapReduce: A Hadoop-specific implementation of MapReduce for large-scale data processing. Consists of the Map function, which filters and sorts data, and the Reduce function, which creates the summary of Map data read by the user.
- Hadoop YARN: The part of the Hadoop program that manages the clusters of data and schedules their use in different Clustered File Systems.
In Hadoop v.2, scheduling and monitoring are sent to YARN, with a resource manager keeping track of scheduling, and an application manager keeping track of the monitoring. In the previous version, the Job Tracker in MapReduce schedules a job and monitored it for failure, latency, etc. The Task Tracker runs the job and sends it to the Job Tracker. Later versions of Hadoop, MapReduce does its job after scheduling.
How It Works
A Hadoop works through two main systems, HDFS and MapReduce. The HDFS has the following five services:
- HDFS stores the data used by the Hadoop program. It consists of a name node (master node) that tracks files, manages the file system, and contains the metadata and all of the data in it.
- The Data Node stores the data in blocks in HDFS and is a slave node to the master.
- A Secondary Name Node handles the metadata checkpoints of the file system.
- The Job Tracker receives requests for MapReduce processing from the user.
- The Task Tracker serves as a slave node to the job tracker. It takes the job and associated code and applies it to the relevant file.
The job tracker and task tracker make up the MapReduce engine. Each MapReduce engine contains one job tracker that receives MapReduce job requests from the user and sends them to the appropriate task tracker. The goal is to keep the task on the node closest to the data in process.
Hadoop v.2 implements YARN between the HDFS and MapReduce.
Hadoop v.3 implements multiple name nodes, solving potential single-point-failure problems.
Apache Hadoop makes it possible to process massive amounts of data quickly. Some common uses include:
- Analysis of life-threatening risks
- Identification of warning signs of security breaches
- Prevention of hardware failure
- Understanding what people think about your company