Happy birthday to Redshift! Amazon’s data warehouse-as-a-service has just celebrated two years of data querying.
Redshift gained a lot of popularity following its 2013 release, which was hardly surprising. Traditional data warehouses used to require custom hardware and software that took too much maintenance and cost way too much money. Redshift broke this strict paradigm by moving the whole operation onto the fluffy cloud.
Several reviews were written about Redshift at the time, but as far as we know, no one has looked back to check on what’s happening with the red giant since then. So, we went ahead and did a little checkup. Here’s our up-to-date Redshift review.
Ease of Use
Since Redshift is based on PostgreSQL, database administrators have a much easier time adapting to it than they do with other Big Data technologies. Redshift also has a fancy web UI with metrics for compute utilization, storage utilization, and read/write traffic, as well as automated tasks for data warehousing, administration, and backups to Amazon S3.
Redshift is compatible with standard JDBC and ODBC drivers. This means that it can integrate with most BI and ETL tools that support PostgreSQL. Amazon validated that it does—at least for the popular tools.
Redshift’s performance is well known across the galaxy. Back when it was released, the Airbnb Nerds did some performance tests and found out that Redshift had up to a 20x performance improvement over Hive. Berkley AMPLab’s Big Data benchmarks from February 2014 confirm that Redshift still beats Hive by a mile, although it comes in close to Impala and Spark when running in memory.
High Scalability and Elasticity
Since Redshift is on the cloud, you can easily scale to the petabyte range without having to purchase or upgrade any hardware or software. You can even shrink or grow Redshift’s clusters, depending on your current needs.
According to AWS Evangelist Jeff Barr, running a data warehouse costs $19,000-$25,000 per terabyte per year. Amazon Redshift, however, costs $1,000 per TB per year. So, a lot of companies with other data warehouse solutions could save money by switching to Redshift, and organizations who couldn’t previously afford any data warehousing could up their game.
Redshift’s data is encrypted both at rest and when in transit. Each block is encrypted with hardware-accelerated AES-256 as it is written to disk; Redshift also supports SSL when sending data.
In the event of a drive failure, a Redshift cluster will remain available, though with some performance degradation for certain queries. If an individual node fails, Redshift will automatically detect and replace it. All the data is safe too: it’s replicated within the data warehouse cluster with at least three copies, and continuously backed up to Amazon S3.
The Airbnb Nerds noticed that it takes Redshift a very long time to run joins for billions of rows. Also, since Redshift only allows you to specify one distribution key, running joins against multiple columns on a large scale may also cause performance issues. Other non-Redshift solutions are recommended in these cases (wink wink).
Some Redshift maintenance tasks don’t perform that well because they have limited resources—certain procedures, such as deleting old data, could take a while. And although Redshift shards data, it doesn’t do it optimally. You might end up joining data across different nodes, a procedure that decreases Redshift’s performance. Also, when scaling clusters, the data needs to be reshuffled among the machines. This could take several days and plenty of CPU power, slowing your system for regular operations.
Last but not least, loading data to Redshift can take a while, since the default loading mechanism is single threaded. Nonetheless, breaking the data into slices should help to load it in parallel, and you can even ship a physical storage device to Amazon, who will load the data for you.
How many queries can you run simultaneously on Redshift? Up to 50, if you ask Amazon, though they recommend setting up query queues with only a maximum of 15 concurrent queries. If you use tools like Tableau that generate a lot of queries, Redshift’s low concurrency could slow down your data querying efforts.
Limited Data Sources
Redshift doesn’t play with all the databases in the playground. You can only load data into Amazon Redshift from Amazon S3, Amazon DynamoDB, Amazon EMR, any SSH-enabled host on Amazon EC2 or on-premises, or via AWS Data Pipeline. If you’d like to load some data, say, from your MySQL database, MongoDB, or non-Amazon clouds, you’ll need to use something else.
Restricted Data Formats
Redshift is pretty tight about data formats: it only accepts flat text files in a fixed format such as CSV and JSON, though you’ll need to map JSON properties to Redshift columns. However, the serial data type, arrays, and XML are all unsupported at the moment. Redshift also has issues with newline characters, multiple NULLS, and certain UTF-8 characters, so you’ll need to spend time massaging your data before you can load it into Redshift.
The largest Redshift node comes with 16TB of storage, yet a maximum of 100 nodes can be created. Therefore, if your Big Data goes beyond 1.6PB, Redshift will not do.
The Elastic Ninja blog lists several more issues with Redshift. It claims that although AWS will replace a failed node on a cluster, this could take several minutes, which may not be acceptable for certain applications, especially if the end-users are exposed to it. The blog also claims that unlike other columnar databases, Redshift only uses a single encoding/sorting/distribution scheme which is not optimized for all cases and that Redshift doesn’t have a mechanism for defining sort orders or distribution. Finally, Elastic Ninja claims that Redshift does require some maintenance to keep it running optimally after all, for example, updating statistics and running the “vacuum” statement to optimize new data for querying.
All in all, Redshift hasn’t lost its glow since being released. Redshift is an affordable data warehouse solution with ease of use, high performance, PostgreSQL compatibility, scalability and elasticity, tight security, and fault tolerance. Nonetheless, Redshift does have certain performance issues, concurrency limitations, restricted data sources and formats, confined scaling, and several other constrictions. Still, considering the other options out there, Redshift is still an excellent solution for interactive querying of Big Data for the SQL savvy.