What is Hadoop?

It used to be that the data we knew how to process was all the same.

Columns, fields, zeros and ones. Data was structured and nice and neat and easy to read. Easy enough. Nowadays though, data is flying in from a plethora of sources, at increasing speeds, and exploding volumes. As such, today’s data has gotten a little too big for it’s britches, and plain old data became Big Data.

Then, somewhere along the way, people figured out that data could be used for all sorts of things including reporting, predictive analysis, and automation, to name a few. The problem with these great ideas is they were lapping the systems supporting the data, and new technologies were needed to keep up.

Around this time, a couple of fellas created a new technology called Hadoop that was being used to support distribution for a search engine project known as Nutch.

What is Hadoop?

Hadoop is an open source project derived from white papers about Google Mapreduce and GFS (
Google File System), to store and process mass quantities of data on clusters of affordable servers.

Hadoop is comprised of a distributed file system that stores data on the aforementioned cluster nodes, and a computational model known as MapReduce that spreads work across these nodes, ensuring quick processing and redundancy.

Why all the hype?

In 2001, Doug Laney of Gartner Research wrote 3-D Data Management: Controlling Data Volume, Velocity and Variety, becoming what is now known as The 3 V’s of Big Data. Hadoop addresses these V’s, and by doing so, it has changed the dynamic of large scale data processing and computing. So what is it about Hadoop that has made it more than just a buzzword?

Well for starters, Hadoop is incredibly Scalable. Cluster nodes can be added and subtracted as needed, without changing data formats, and without affecting associated applications.

Secondly, Hadoop is Flexible. Hadoop can handle any type of data, from any number of sources. It can take Data from multiple sources and aggregate it, filter it, select it, whatever, transforming it in ways that enable deeper analyses than any one system can provide.

Third, Hadoop is fault tolerant, and not fault tolerant like your mom is, where you can do no wrong. In this case, if you lose a cluster node, the job is just passed to another node as if nothing ever happened and your work gets done without a glitch. It enables applications to work with thousands of computation-independent computers and petabytes of data.

Have a look at our Hadoop-as-a-Service page that breaks down some of Hadoop’s pros and cons.