Mortal Kombat’s master of ice Sub-Zero and the living-dead fire breathing Scorpion are major archenemies. As the story goes, Sub-Zero and his clan of assassin ninjas slaughtered their rival clan, which Scorpion and his family were members of. Scorpion’s hatred made him rise from the Netherrealm to avenge his family’s death and kill Sub-Zero in the great tournament.
Back in the Big Data realm, (code) ninjas from two rival clans are bashing it out - Hadoop on-premise vs. Hadoop-as-a-Service (HaaS). With more and more companies leaving the former and joining the latter, is it really the ultimate method to process Big Data? Let’s get in the ring and see.
Whether on-premise or as a service, Hadoop scales horizontally by adding more machines to the cluster.
It’s a tie
Elasticity means adapting to the workload in real-time by adding or removing resources as needed. Unfortunately, on-premise Hadoop is not elastic at all - clusters can’t grow or contract quickly in the case of changing demands. Even if extra machines are kept on the side for peak hours, they are not utilized the rest of the time which means plenty of money goes down the drain.
On the other hand, Hadoop-as-a-Service is extremely elastic with the ability to initialize or stop cloud instances on demand. Even the entire cluster can be used and completely discarded once a job is finished.
Another win for HaaS
According to a Hadoop study by Accenture, the total cost of ownership (TCO) of Hadoop on the cloud is much lower than on-premise. Plenty of money is saved on staff, technical support, and by not having to keep a data center. This leaves more cash to buy resources - for the same monthly budget of over $21,000 (USD), more than five times as much money can be spent on virtual machines on the cloud than bare-metal servers on-premise (over $15,000 instead of $3,000). The elasticity of Hadoop-as-a-Service also helps saving money since the exact amount of resources are used and paid for.
HaaS takes the round
There is a common assumption that Apache Hadoop runs slowly on virtual environments due to intensive I/O operations. However, according to the same study by Accenture, cloud based setups with the same TCO as a bare-metal setup have better performance running Hadoop with real-world applications.
The TCO based comparison definitely makes HaaS a winner, but a one-on-one review with the same hardware, though more costly, would probably make on-premise the winner considering faster network access and data locality. Also, in Accenture’s previous study where Amazon Elastic Map Reduce was used, there were a few cases in which bare-metal Hadoop outperformed Hadoop on the cloud (see pages 10-12). We also did some of our own tests, though both in the cloud, where local HDFS outperformed Amazon S3 with twice as fast writing data and almost three times as fast reading data. One can expect numbers to be a lot better on real servers with 6-12 sata3 disks, the best practice for Hadoop.
Tie - HaaS based on a TCO comparison, on-premise for a one on one hardware comparison
Ease of Use
Deploying Hadoop on-premise is difficult. It requires hiring Hadoop experts or sending sysadmins to training. Even so, provisioning, configuring, and maintaining Hadoop aren’t easy tasks. Once that’s taken care of, programming and executing Hadoop jobs requires knowledge in MapReduce and high level languages like Pig which means more hiring or training.
Hadoop-as-a-Service makes Hadoop a lot simpler. It provides easy to use tools to provision and configure clusters while maintenance is taken care of. Also, Hadoop-as-a-Service may include a user interface that allows running MapReduce jobs without any coding at all.
Availability for on-premise Hadoop really depends on the quality of hardware and manpower available. There are no guarantees. High Availability for the NameNode server depends on whether Hadoop YARN is used where such features are available, or workarounds have been prepared for older versions.
Hadoop-as-a-Service is based on the cloud, so availability depends on which provider is used and their service level agreement. Amazon promises at least 99.95% availability for EC2 while Google promises 99.9% availability for Cloud Storage & BigQuery. High availability depends on the service provider’s implementation, so it may be available or not. But even if the cluster isn’t working, for whatever reason that may be, it’s always possible to add more machines or load a new cluster on the cloud, what simply isn’t possible on-premise.
On-premise durability depends on HDFS. According to a statistical model for HDFS data durability, the probability of losing a block of data (64MB by default) on a large 4,000 node cluster (16 PB total storage, 250,736,598 block replicas) is 5.7x10-7 in the next 24 hours and 2.1x10-4 in the next 365 days. However, for most clusters, which contain only a few dozen instances, the probability of losing data can be much higher.
Hadoop-as-a-Service durability depends on the way the data is stored. If using HDFS, then it’s not much different than on-premise. However, using other ways such as S3 may tip the scale - S3 provides a durability of 99.999999999% of objects per year, meaning that a single object could be lost per 10,000 objects once every 10,000,000 years (see the S3 FAQ). That’s pretty durable.
HaaS has the advantage
Deploying Hadoop securely on-premise means utilizing Hadoop features such as kerberos authentication and setting the right file permissions. Either way, it is the sysadmin’s responsibility.
With Hadoop-as-a-Service, it depends on the service provider’s implementation, so it’s really important to check their security policy. Amazon Web Services does provide security features such as virtual private clouds, encryption, security groups, and more, so HaaS can definitely be secure if implemented correctly.
It’s a tie - depends on implementation
Any Hadoop distribution or Hadoop related project can be used on-premise and any MapReduce code can be executed.
Since Hadoop-as-a-Service is managed, it may arrive with certain limitations on which distribution or Hadoop project is available and whether or not custom code can be written.
Finally, a round for on-premise
Clearly Hadoop-as-a-Service beats on-premise with elasticity, ease of use, and so forth. Running Hadoop on-premise may be the right choice only for large organizations that use their Hadoop 24/7, need to customize it, and have the right budget and manpower. Even so, a middle way of running Hadoop on the cloud rather than as a service, like Netflix does, could be the right choice.