Storing Apache Hadoop Data on the Cloud - HDFS vs. S3

Storing Apache Hadoop Data on the Cloud - HDFS vs. S3

(Public domain image)

Ken and Ryu are both the best of friends and the greatest of rivals in the Street Fighter game series. When it comes to Hadoop data storage on the cloud though, the rivalry lies between Hadoop Distributed File System (HDFS) and Amazon’s Simple Storage Service (S3).

Although Apache Hadoop traditionally works with HDFS, it can also use S3 since it meets Hadoop’s file system requirements. Netflix’ Hadoop data warehouse utilizes this feature and stores data on S3 rather than HDFS. Why did Netflix choose this data architecture? To understand their motives let’s see how HDFS and S3 do... in battle!

Round 1 - Scalability

HDFS relies on local storage that scales horizontally. Increasing storage space means either adding larger hard drives to existing nodes or adding more machines to the cluster. This is feasible, but more costly and complicated than S3.

S3 automatically scales according to the stored data, no need to change anything. Even the space available is practically infinite (no limit from Amazon at least).

First round goes to S3

Round 2 - Durability

A statistical model for HDFS data durability suggests that the probability of losing a block of data (64MB by default) on a large 4,000 node cluster (16 PB total storage, 250,736,598 block replicas) is 5.7x10-7 in the next 24 hours and 2.1x10-4 in the next 365 days. However, for most clusters, which contain only a few dozen instances, the probability of losing data can be much higher.

S3 provides a durability of 99.999999999% of objects per year, meaning that a single object could be lost per 10,000 objects once every 10,000,000 years (see the S3 FAQ). It gets even better. About a year and a half ago one of my colleagues at Xplenty took an AWS workshop at Amazon. Their representatives claimed that they haven’t actually lost a single object in the default S3 storage (a cheaper option of Reduced Redundancy Storage (RRS) is available with a durability of only 99.99%).

Large clusters may have excellent durability, but in most cases S3 is more durable than HDFS.

S3 wins again

Round 3 - Persistence

Data doesn’t persist when stopping EC2 or EMR instances. Nonetheless, costly EBS volumes can be used to persist the data on EC2.

On S3 the data always persists.

S3 wins

Storing Hadoop Data on the Cloud - HDFS vs. S3

(Public domain image)

Round 4 - Price

To keep data integrity, HDFS stores by default three copies of each block of data. This means that the storage space needed by HDFS is 3 times than the actual data and costs accordingly. Although the data replication is not mandatory, storing just one copy would eliminate the durability of HDFS and could result in loss of data.

Amazon take care of backing up the data on S3, so 100% of the space is available and paid for. S3 also supports storing compressed files which considerably reduces the space needed as well as the bill.

Winner - S3

Round 5 - Performance

HDFS performance is great. The data is stored and processed on the same machines which improves access and processing speed.

Unfortunately S3 doesn’t perform as well as HDFS. The latency is obviously higher and the data throughput is lower. However, jobs on Hadoop are usually made of chains of map-reduce jobs and intermediate data is stored into HDFS and the local file system so other than reading from/writing to Amazon S3 you get the throughput of the nodes' local disks.

We ran some tests recently with TestDFSIO, a read/write test utility for Hadoop, on a cluster of m1.xlarge instances with four ephemeral disk devices per node. The results confirm that HDFS performs better:

  HDFS on Ephemeral Storage Amazon S3
Read 350 mbps/node 120 mbps/node
Write 200 mbps/node 100 mbps/node

HDFS takes the round

Round 6 - Security

Some people think that HDFS is not secure, but that’s not true. Hadoop provides user authentication via Kerberos and authorization via file system permissions. YARN, Hadoop’s latest version, takes it even further with a new feature called federations - dividing a cluster into several namespaces which prevents users from accessing data that does not belong to them. Data can be uploaded to the amazon instances securely via SSL.

S3 has built-in security. It supports user authentication to control who has access to the data, which at first only the bucket and object owners do. Further permissions can be granted to users and groups via bucket policies and Access Control Lists (ACL). Data can be encrypted and uploaded securely via SSL.

It’s a tie

Round 7 - Limitations

Even though HDFS can store files of any size, it has issues storing really small files (they should be concatenated or unified to Hadoop Archives). Also, data saved on a certain cluster is only available to machines on that cluster and cannot be used by instances outside the cluster.

That’s not the case with S3. The data is independent of Hadoop clusters and can be processed by any number of clusters simultaneously. However, files on S3 have several limitations. They can have a maximum size of only 5GB and additional Hadoop storage formats, such as Parquet or ORC, cannot be used on S3. That’s because Hadoop needs to access particular bytes in these files, an ability that’s not provided by S3.

Another tie

Tonight’s Winner

With better scalability, built-in persistence, and lower prices, S3 is tonight’s winner! Nonetheless, for better performance and no file sizes or storage formats limitations, HDFS is the way to go.

Integrate Your Data Today!

Get a 7-day free trial. No credit card necessary.