Fear of a Hadoop Planet

Fear of a Hadoop Planet

(Planet image by David A. Aguilar (CfA), Some rights reserved)

Despite the Hadoop hype machine crunching away, not everyone is fond of that little yellow elephant. In fact, some fear it. But why should the cute mammal and the innovative data processing technology that it represents raise anxiety levels? Everyone has their reasons.

Hadoop Is Confusing 

According to a recent Gartner webinar, many people just don’t know what they’re supposed to do with Hadoop. Maybe the exaggerated hype or the fact that it’s a highly technical product has scared people off.

Let’s get the story straight. Hadoop is not the end all solution to every data problem. What Hadoop does is store large files over a cluster of commodity servers and process that data in parallel with MapReduce. It can, for example, be used as part of an ETL process or as a storage facility. Having said that, various projects based on Hadoop, such as HBase, add more capabilities and Hadoop’s latest major version opens it up to new applications such as Tez for interactive data or Storm for streaming data. We are likely to see it used in new and exciting ways real soon.

Hadoop Is Too New

Relative to other technologies Apache Hadoop is the new data store on the block. While it was released in 2005, Oracle SQL has been here since 1978, Microsoft SQL Server since 1989, and MySQL since 1995. Some fear that Hadoop is not ready for production yet because it is still being developed and has yet to fully mature like the other technologies have.

Although Hadoop is young, not everyone is afraid to use it. Companies like Facebook, Yahoo, and more than half of the Fortune 50 use Hadoop. Several enterprise level Hadoop distributions such as Hortonworks and Cloudera are available and so is high level support. Hadoop itself is starting to mature with a new major version that was released just a few months ago. It may be named after a child’s toy, but it certainly isn’t one.

My Data Is Not Big Enough

Just how big is Big Data? Rumor has it that if you do not have petabytes (1015 bytes) of data then there is no point to use Hadoop.

‘Tis true that Hadoop is designed for handling huge volumes of data in great velocity and variety. However, you do not need petabytes of data in order to use Hadoop to your advantage. Processing even just a hundred gigabytes with Hadoop is affordable on the public cloud. The data can be processed only when required on a pay-as-you-go basis, so there is no need to buy and maintain any machines. Prices on Amazon Elastic Map Reduce start from $0.075 an hour.

Actually, companies consider Big Data for sizes that are much smaller than petabytes. According to a survey made by NewVantage Partners, only 7% of the 50+ top executives from government and Fortune 500 firms consider using Big Data because they need to analyze more than 1 PB of data. The other 93% have other needs, like analyzing streaming data or data from diverse sources. Altogether, 28% of the survey’s participants consider using Big Data despite having less than 1 PB of data.

Even if you do not have much data right now, your company will hopefully grow and so will your data. Soon enough your relational database will not be able to handle the batch processing heat, especially considering the data explosion of social data and the up and coming Internet of things. Once you have terabytes of data you will need to scale and scale again when you reach the petabyte range, so you might as well scale to Hadoop now. Since Hadoop scales horizontally, it means you can start off with a small cluster and add the exact number of machines or cloud instances as you grow.

Can’t Replace My IT Department

IT departments are trained in the old ways of the SQL Jedi. Getting them to implement and maintain Hadoop requires training or hiring new personnel which are in high demand and expensive.

This is true if you run Hadoop on-premise. The maintenance hassle can be avoided by using Hadoop-as-a-Service providers such as Amazon’s Elastic Map Reduce, though it still requires some technical skills to configure and setup a Hadoop environment. Other solutions, like (shameless self promotion) our platform, automate setup and configuration, enable cluster creation, monitoring, and proactive and reactive maintenance with just a few clicks.

Hadoop Is Hard to Learn

Hadoop works with MapReduce technology that is programmed in Java, yet the common language to query and process data (except for Excel) is SQL. To program MapReduce jobs one must learn not only Java, but a different approach to processing data than relational database querying. Once again, the choice seems to be re-education or re-hiring which are non-options.

This fear can be partially resolved. Pig Latin, a high level querying language, and Hive, an SQL like querying language, are both available for processing data on Hadoop without even one line of Java code. They still require some training though and there’s no way to avoid learning MapReduce. Certain platforms, ours included, ease up the learning curve by providing a user interface to process data without any code at all.

Why companies resist using Hadoop and whether these fears are founded

(Fear sign image by Sistak, modified by Xplenty, Some rights reserved)


Hadoop Is not Secure

Companies have sensitive data. They want to make sure it is fully secure and control who can see it, what they can do with it, and how many resources they can use. Some don’t believe Hadoop provides these features.

This is wrong. Hadoop provides user authentication via Kerberos and authorization via file system permissions. YARN also has a new feature called federations which can divide a cluster into several namespaces, thus isolating different sections of the cluster and preventing users from messing with data that does not belong to them. The Apache Accumulo project, a distributed key/value store which runs on top of Hadoop, steps it up a notch by providing access control per cell of data, perfect for security control freaks.

Hadoop Has a Single Point of Failure

Hadoop uses a server called the NameNode that keeps stock of all the files and where they are saved across the cluster. If the NameNode goes bye bye then the entire cluster and all its files become unusable.

Indeed, this was a major weakness in the previous Hadoop version. To prepare for NameNode catastrophes the solution was to continuously back it up to another machine and change the DNS name to point to the backup machine in case of a failure. Fear not, for this is resolved in Hadoop YARN with new high availability features including an up to date standby NameNode and automated failovers.

Hadoop Is Too Expensive

Some people think that Big Data costs Big Bucks. Whereas relational databases seem affordable, Hadoop seems like a luxury that only enterprises can afford.

Actually, the opposite is true - Hadoop is the most cost-effective Big Data solution ever. In the past, only large enterprises could afford proprietary IBM or EMC racks to store and process Big Data, servers that can cost millions of dollars. Hadoop, however, runs on commodity machines which companies of all sizes can afford. Hadoop services on the cloud bring prices down even further by providing clusters with pay-as-you-go models.

Hadoop Sucks for Small Files

The fear is that Hadoop cannot handle small files in the kilo or megabyte range. And you know what? It’s true. If you store a lot of files that are much smaller than the HDFS block size, 64 MB by default, Hadoop’s performance isn’t going to be good.

This is totally solvable. The simple fix, at least for small text files, is to concatenate them into a few huge files. If the files cannot be unified (e.g. image files), Hadoop Archives (HAR files), could do the trick. HAR files represent another filesystem layer on top of HDFS that archives many files into several HAR files via the command line. The archived files can be accessed directly using har:// URLs.

Open Source Software Is Unreliable

Hadoop is an open source project. Some folks still believe the old myths that open source is not secure, not reliable, and lacks support.

Obviously this is not true. Open source projects like the Apache web server and MySQL have proven themselves many times over in terms of security and reliability. Actually, some of the major contributors to Hadoop are big commercial companies like Yahoo, Facebook, and even Microsoft (see, for example, The Stinger Initiative). As mentioned above, enterprise Hadoop distributions and support are also available.

Summary

Quite a few fears prevent companies from using Hadoop, from lack of knowledge to what Hadoop actually does, through IT concerns, and all the way to mistrust in open source. But many of these fears are based on superstition and those that are not are resolvable. Maybe it’s time for a reality check as to why your organization resists using Hadoop and whether these fears are actually founded.


Integrate Your Data Today!

Try Xplenty free for 7 days. No credit card required.