While it might not be as epic as Batman vs. Superman or the Rumble in the Jungle, the world of big data has its own share of major rivalries. In November 2012, Amazon announced Redshift, their cutting-edge data warehouse as a service that scales for only $1,000 per terabyte per year. Since then, Apache Hadoop, created in 2005, hasn't been the only big data superhero on the block anymore. This leaves us with one big question: how does Hadoop compare with Amazon Redshift?
Let’s get both contenders in the ring and find out in the battle of Hadoop vs. Redshift.
Table of Contents
- Hadoop vs. Redshift: The Basics
- Hadoop vs. Redshift: Scaling
- Hadoop vs. Redshift: Performance
- Hadoop vs. Redshift: Pricing
- Hadoop vs. Redshift: Ease of Use
- Hadoop vs. Redshift: Formats and Types
- Hadoop vs. Redshift: Data Integrations
- Hadoop vs. Redshift: The Winner
- How Xplenty Can Help
TRUSTED BY COMPANIES WORLDWIDE
Enjoying This Article?
Receive great content weekly with the Xplenty Newsletter!
Hadoop vs. Redshift: The Basics
Time for the Hadoop vs. Redshift battle! In the left corner, wearing a black cape, we have Apache Hadoop. Hadoop is an open-source framework for distributed processing and storage of big data on commodity machines. It uses HDFS, a dedicated file system that cuts data into small chunks and spreads them optimally over a cluster. The data is processed in parallel on the machines via MapReduce (Hadoop 2.0, aka YARN, allows for other applications as well).
In the right corner, wearing a red cape, we have Redshift. Amazon Redshift’s data warehouse-as-a-service comes from technology acquired from the data warehouse vendor ParAccel. Redshift has been built on an old version of PostgreSQL, with three major enhancements:
- Columnar database: A columnar database returns data by columns rather than whole rows. It has better performance for aggregating large sets of data, perfect for analytical queries.
- Sharding: Redshift supports data sharding—that is, partitioning the tables across different servers for better performance.
- Scalability: With everything running on the cloud, Redshift clusters can be easily upsized and downsized as needed.
Traditional data warehousing solutions by companies like Oracle and EMC have been around for a while, though only as million-dollar on-premise racks of dedicated machines. Amazon’s innovation in creating Redshift, therefore, lies in pricing and capacity. Their pay-as-you-go promise, as low as $1,000/terabyte/year, makes a powerful data warehouse within reach for small and medium-sized businesses who couldn’t previously afford it. Because Redshift is in the cloud, it shrinks and grows as needed, instead of having to use big dust-gathering machines with a fixed size that you need to maintain on a regular basis.
Enough words—time to battle. Are you ready? Let’s get ready to have a Hadoop vs. Redshift rumble!
Hadoop vs. Redshift: Scaling
Redshift comes with a built-in maximum storage capacity: RA3.16XL clusters have a managed storage quota of 64 terabytes per node, and can scale up to 128 nodes. This means that Redshift tops out at 64 x 128 = 8192 terabytes or 8.192 petabytes. If you have more than this amount of data, or if you expect to have more in the future, then Redshift won’t work for you. According to a 2016 survey by IDG, the average enterprise manages 348 terabytes of data—but the world of big data has expanded exponentially since then, and many organizations generate multiple terabytes of data on a daily basis. Also, when scaling Amazon’s clusters, the data needs reshuffling among the machines, which could take several days and plenty of CPU power—slowing down your system and blocking other operations.
Fortunately, Hadoop scales to as many petabytes as you want. Twitter, for example, reportedly has a 300-petabyte Hadoop cluster that it hosts in the Google cloud. What’s more, scaling Hadoop doesn’t require reshuffling, since new data gets saved on the new machines. In case you do want to balance the data, there is a Hadoop rebalancer utility available.
The first round goes to Hadoop!
Hadoop vs. Redshift: Performance
According to several performance tests made by the nerds over at Airbnb, a Redshift 16-node dw.hs1.xlarge cluster performed a lot faster than a Hive/Elastic MapReduce 44-node cluster. For example, in a simple range query against 3 billion rows of data, Hadoop took 28 minutes to complete, while Redshift took just 6 minutes. Another Hadoop vs. Amazon Redshift benchmark made by FlyData, a data synchronization solution for Redshift, confirms that Redshift performs faster when working with terabytes of data.
Nonetheless, there are some constraints to Redshift’s super speed. Certain Redshift maintenance tasks have limited resources, so procedures like deleting old data could take a while. Although Redshift shards data, it doesn’t do it optimally. You might end up joining data across different nodes and miss out on the improved performance.
Plus, Hadoop still has some tricks up its utility belt. FlyData’s benchmark concludes that while Redshift performs faster for terabytes, Hadoop performs better for petabytes. Airbnb agrees, and states that Hadoop does a better job of running big joins over billions of rows. Unlike Redshift, Hadoop doesn’t have hard resource limitations for maintenance tasks. As for spreading data across nodes optimally, saving it in a hierarchical document format should do the trick. It may take extra work, but at least Hadoop has a solution.
In this round, we have a tie: Redshift wins for terabytes, Hadoop for petabytes.
Hadoop vs. Redshift: Pricing
The question of Hadoop vs. Redshift pricing is a tricky one. Amazon claims that “Redshift costs less to operate than any other data warehouse.” However, Redshift’s pricing depends on the choice of region, node size, storage type, and whether you work with on-demand or reserved resources. Paying $1000/terabyte/year might sound like a good deal, but it only applies for 3 years of a reserved XL node with 2 terabytes of storage in the U.S. East (North Virginia) region. Working with the same node and the same region on-demand costs $3,723/terabyte/year, more than triple the price, and choosing the region of Asia Pacific costs even more.
On-premises Hadoop is definitely more expensive. According to Accenture’s "Hadoop Deployment Comparison Study", the total cost of ownership of a bare-metal Hadoop cluster with 24 nodes and 50 terabytes of HDFS is more than $21,000 per month. That’s about $5,040/terabyte/year, including maintenance. However, it doesn’t make sense to compare pears with pineapples, so let’s stick to comparing Redshift with Hadoop as a service.
Pricing for Hadoop as a service isn’t exactly transparent, since it depends on how much juice you need. FlyData’s benchmark claims that running Hadoop via Amazon’s Elastic MapReduce is 10 times more expensive than Redshift. Using Hadoop on Amazon’s EC2 is a different story, however. Running a relatively low-cost m1.xlarge machine with 1.68 terabytes of storage for 3 years (heavy reserve billing) in theU.S. East region costs about $124 per month, so that’s about $886/terabyte/year. Working on-demand, using SSD drive machines, or a different region will increase prices.
There's no clear winner in this round—it all depends on your needs.
Hadoop vs. Redshift: Ease of Use
Redshift automates data warehouse administration tasks, as well as automatic backups to Amazon S3. Transitioning to Redshift should be a piece of cake for PostgreSQL developers since they can use the same queries and SQL clients that they’re used to.
Handling Hadoop, whether in the cloud or not, is trickier. Your system administrators will need to learn Hadoop-specific architecture and tools, and your developers will need to learn coding in Pig or MapReduce. Heck, you might even need to hire new staff with Hadoop expertise. There are, of course, Hadoop-as-a-service solutions that save you from all that trouble (ahem). However, most data warehouse devs and admins will still find it easier to use Redshift.
Redshift takes the round here.
TRUSTED BY COMPANIES WORLDWIDE
Enjoying This Article?
Receive great content weekly with the Xplenty Newsletter!
Hadoop vs. Redshift: Formats and Types
When it comes to file formats, both Redshift and Hadoop are fairly accepting. Redshift accepts both flat text files and formats such as CSV, Avro, JSON, Parquet, ORC, and shapefiles. Hadoop, like Redshift, accepts a wide variety of file formats: text files, CSV, SequenceFiles, Avro, Parquet, RCFile, and ORC.
In terms of data types, it gets a little more complicated. If your choice of Hadoop or Redshift doesn’t support the data types you need, you’ll need to spend time converting your data before you can use it with Redshift.
- SMALLINT (two-byte integers)
- INTEGER (four-byte integers)
- BIGINT (eight-byte integers)
- DECIMAL (exact numeric of selectable precision)
- REAL (single-precision floating point)
- DOUBLE PRECISION (double-precision floating point)
- BOOLEAN (true/false)
- CHAR (fixed-length string)
- VARCHAR (variable-length string)
- DATE (calendar date)
- TIMESTAMP (date and time without time zone)
- TIMESTAMPTZ (date and time with time zone)
- GEOMETRY (geospatial data)
- HLLSKETCH (special data type for HyperLogLog)
- TIME (time of day without time zone)
- TIMETZ (time of day with time zone)
As you can see, both services offer very similar data types, including integers and floating-point numbers, strings, and time-based data. But they’re not completely identical: Redshift has data types for geospatial data and the HyperLogLog algorithm, while Hadoop has complex data types such as arrays, structs, and maps.
It’s another tie—we’ll let you decide which file formats and data types are most important to you.
Hadoop vs. Redshift: Data Integrations
In other words, unless you load all of your Redshift data from a remote host, you’ll need to store it within the AWS ecosystem. Not only will you have to use more of Amazon’s services, but you’ll need to spend extra time preparing and uploading the data.
Redshift loads data via a single thread by default, so it could take some time to load. Amazon suggests certain best practices to speed up the process, such as splitting the data into multiple files, compressing them, using a manifest file, etc. Moving the data to DynamoDB is, of course, a bigger headache, unless it’s already there.
Life is more flexible with Hadoop. You can store data on local drives, in a relational database, or in the cloud (including in S3), and then import it straight into the Hadoop cluster.
Another round for Hadoop.
Hadoop vs. Redshift: The Winner
We have a tie!
But wait! Didn’t Hadoop win more of the rounds? Yes, it did, but these two superheroes of big data are better off working together as a team instead of at each other's throats.
Turn on the Hadoop-Signal when you need relatively cheap data storage, batch processing of petabytes, or processing data in non-relational formats. Call out to red-caped Redshift for analytics, fast performance for terabytes, and an easier transition for your PostgreSQL team. As Airbnb concluded in their benchmark: "We don’t think Redshift is a replacement of the Hadoop family due to its limitations, but rather it is a very good complement to Hadoop for interactive analytics." We couldn't have said it better ourselves.
Integrate Your Data Today!
Try Xplenty free for 14 days. No credit card required.
How Xplenty Can Help
Looking to take your Amazon Redshift usage to the next level? If you need to push data into Amazon Redshift, Xplenty is here to help. Our cloud-based ETL (extract, transform, load) solution provides simple visual data pipelines for building automated data workflows across a wide range of sources and destinations.
Ready to give Xplenty a try? Contact our team today to schedule a demo and a risk-free trial, and start experiencing the platform for yourself.