How does Hadoop work?

How exactly does Hadoop work?

Hadoop 2.0 has three main layers: HDFS (Hadoop Distributed File System), YARN, and the application layer (on which the MapReduce engine sits), all operating on Master and Slave nodes.

HDFS:

HDFS is a distributed, scalable, and portable file system that stores and processes large files. On the HDFS layer of a cluster, there is a name node service residing on the master node, and one or more data node services on the slave node(s). The name node is in charge of keeping track of where the data is and the data nodes are in charge of storing and retrieving the data.
Files stored in HDFS are automatically replicated and spread across multiple machines. This redundancy is what makes Hadoop reliable, so if and when a slave node or hard drive goes down, these files have already been replicated to other slaves, and life goes on.

YARN:

Before Hadoop 2.0, the data processing layer on top of HDFS, called MapReduce, controlled both resource allocation as well as execution of applications or map/reduce jobs. Hadoop 2.0 separated the two tasks (resource management and application execution) in order achieve better scalability and allow other types of applications to execute on a Hadoop cluster. A master daemon called the ResourceManager manages resources on the cluster level and a slave daemon called NodeManager manages resources on the node level. The application itself is left on the user space and MapReduce is one of the application implementations. Each application has an ApplicationMaster that runs on one of the slave nodes and requests resources from the ResourceManager.

YARN’s ResourceManager focuses solely on scheduling, allowing it to better manage clusters and which resources are going to which application. It, along with a per-node slave (the NodeManager), form the data-computation framework. The ResourceManager determines what resources go to which applications in the system, and when they go.

The ApplicationMaster is a framework specific library and is in charge of getting resources from the ResourceManager and working with the NodeManager(s) to execute and monitor the tasks.

Applications:

With Hadoop 2.0 and YARN came a bevy of new applications that allow for more task specific processing. If you want to perform batch processing there is MapReduce. For interactive processing, you have Tez. HBase is used for online processing. Storm was developed to tackle the ever increasing need for streaming processing. Facebook and Linkedin needed graph processing so Giraph was invented. Need in-memory processing? Spark is your guy.

MapReduce:

One of the applications working on top of YARN is the MapReduce engine, which also uses master and slave nodes. It was the core of Hadoop 1.0, and is still a main component of 2.0.

In Hadoop 1.0, the Jobtracker on the master node delegated tasks to Tasktrackers on slave nodes. The master node took jobs and divided them into tasks based on the proximity of the data to the slave node on which the task was assigned. This is known as data locality and it reduces network traffic. It was and still is part of what makes Hadoop scalable, thus making it so awesome. The tasktrackers handled the respective tasks on each node and kept the jobtracker informed of the tasks’ status.

So what’s different about MapReduce since Hadoop 2.0 came out? As noted above, the MapReduce engine not only performed its self-titled duties, but it also performed the resource management. Hadoop 2.0 takes that burden away, allowing MapReduce to focus on its specialty - executing map and reduce tasks close to the data and being incredibly scalable.

The MapReduce model is comprised of the Map() and Reduce() functions.
What the Map() function does is takes a bunch of data and performs some sort of transformation that converts each piece of data into a different kind of data in the form of key-value pairs.
The Reduce() function then combines all the key-value pairs and reduces them into a simpler kind of data.
For example, lets say we have all the score sheets for a baseball season and we’re going to calculate on base percentage using map-reduce. An excerpt from the original data would look like this:

Team Player Walks Hits Outs
Cardinals Matt Carpenter 1 0 0
Cardinals Matt Carpenter 0 1 0
Cardinals Matt Carpenter 0 1 0
Cardinals Matt Carpenter 0 1 0
Cardinals Matt Carpenter 0 0 1

The map function is going to map these rows to player name and number of walks+hits:

<Matt Carpenter, 1>

<Matt Carpenter, 1>

<Matt Carpenter, 1>

<Matt Carpenter, 1>

<Matt Carpenter, 0>

The reduce function will reduce these values into on base percentage:

<Matt Carpenter, .800>

The reason MapReduce is split between Map and Reduce is because this allows us to break the computation into different parts that can be easily done in parallel on many nodes in the cluster. In our example, each node may hold score sheets for different players from different games. Map functions will be executed on each of the nodes and sum the walks and hits per row in the score sheets. Each player’s game data will be handled by one reduce function that’ll calculate his on base percentage.

Whether you’re calculating on base percentage or scaling up to process terabytes of logfiles, MapReduce ensures that your jobs are done as efficiently as possible no matter what the job is.