Choosing the right data warehouse is a critical component of your general data and analytic business needs. One of the biggest questions that businesses ask when they're choosing their data warehouse providers is this: Should you use Amazon or Google's data warehouse for your business needs?
Both Amazon and Google have impressive data warehouses with Redshift and BigQuery. Each of these solutions can run analytics at-scale rapidly. We're not comparing apples and oranges here. This is apples to apples. But, there are real use cases that each of them excel at, and both solutions can be valuable depending upon your business's situational needs.
At Xplenty, we support both solutions. So, this post will act as a guide for businesses looking to understand which data warehouse is best suited for their particular workflows and projects.
Here's a rundown of this post's structure.
- What is a Data Warehouse?
- Understanding Data Processing Systems
- What is RedShift?
- What is BigQuery?
- Redshift vs. BigQuery
- Final Thoughts
What is a Data Warehouse?
Data warehouses (sometimes called columnar storage solutions) are dumping grounds where you can throw all of your BI data for analytic processing. Both Redshift and BigQuery are data warehouses. You can throw all of your data from your blended tech stack and start to run analytics on it to help you make critical business decisions, forecast trends, budget, etc.
A typical data warehouse use case would be trend analysis. Businesses push all of their tech stack data (e.g., customer service, marketing, sales, HR, etc.) into the warehouse to run analytic workloads.
Example: A business may want to know more about their sales leads. This will help them better understand their customers and personalize sales pitches and content delivery. To do this, that business can connect their Salesforce data with a data warehouse and run a query to discover which leads are the most valuable and which ones are most likely to churn
Beyond columnar storage, data warehouses like Redshift and BigQuery have Massively Parallel Processing (or MPP.) This lets them distribute query requests across multiple servers to accelerate processing. So, multiple processors — each with their own memory and operating system — will handle specific segments of the query.
To really understand why data warehouses are valuable for analytic workloads, you need to understand the differences between Online Transaction Processing (OLTP) and Online Analytic Processing (OLAP) data processing systems.
Understanding Data Processing Systems
Let's quickly address the differences between OLTP and OLAP data processing systems.
OLTP (or Online Transaction Processing) is what most business use for processing transactions during day-to-day operations (think ATMs, retail sales systems, text messaging, etc.) And, we've all been using for OLTP for over 40 years (it's still mind-boggling that SQL was released in the early 70s.) OLTP stores each row in a table as an object.
OLTP's primary goal is data processing. It's great at rapid processing that maintains data integrity over multiple sequences.
Example: Let's say that two people withdraw money from the same online bank account at precisely the same moment. OLTP will take the first authorized user and process that transaction. And, it will ensure that neither user is able to withdraw more money than is present in the bank account — even if they both start the operation simultaneously. To do this, OLTP runs checks against every row in the query. This ability to perform ACID (Atomicity, Consistency, Isolation, Durability) transactions means that OLTP is extremely useful for ensuring data validity in the case of errors or outages.
OLAP (or Online Analytic Processing) is what data warehouses use to run queries. OLAP stores each column as an object. So, it's great at crawling through massive data sets to find trends. And, OLAP can skip over pieces of data to find the exact data you need to aggregate.
Example: Let's say you wanted to run a query (say, finding every revision on Wikipedia) on an OLTP database. That OLTP database would have to access every single field in every row to run that process. With OLAP, it can utilize columns to access only the fields you need — saving you an absolutely enormous amount of computing power and time.
In fact, OLAP is so fast at analytic processing, that the majority of businesses utilizing data warehouses look for sub-10 speeds (i.e. under 10 seconds.) There are certainly use cases where the processing could take an hour or more, but you would be dealing with massive amounts of data at that point and hyper-complex schemas.
What is Redshift?
Redshift is Amazon's data warehouse, and it's part of Amazon's massive overall cloud architecture, AWS.
Amazon acquired the source code for Redshift from ParAccel, who was developing ParAccel Analytic Database — a Postgre-based database that utilized columnar data organization. So, Redshift is an MPP data warehouse that's built on a PostgreSQL fork. While Redshift shares many of commonalities with Postgre (such as its relational qualities,) it also is unique in that it's columnar, doesn't support indexes, and uses distribution styles and keys for data organization. Amazon also has a unique query execution engine for Redshift that differs from Postgre.
The important thing to note about Redshift being built on top of a Postgre fork is that it maintains some of its transactional qualities — making is a hybrid database of sorts. Redshift can still roll-back on transactions, which is a semi-unique feature in the data warehouse market.
What is BigQuery?
BigQuery is Google's data warehouse, and it's part of Google's massive overall cloud architecture, Google Cloud.
BigQuery was one of the first major data warehouses on the market — following C-Store and Monet DB. To function, BigQuery executes Dremel (A query engine developed by Google for read-only nested data that supports an SQL-like syntax) over a REST interface.
Google defines Dremel as:
"Dremel is a query service that allows you to run SQL-like queries against very, very large data sets and get accurate results in mere seconds."
When BigQuery first launched, it had strictly maintained Dremel's strange hybrid SQL language — which was awkward at best. Now, it supports standard SQL language.
Google has some unique technology that powers BigQuery operations. Here's a brief overview of a typical job execution:
- Borg (Google's large-scale cluster management) allocates resources to Dremel jobs (which are typically executed over Google's Web UI or REST.)
- Colossus (Google's planet-scale storage system) provides the data to each Dremel job.
- Capacitor (Google's columnar storage format) organizes and compresses the data being pulled for the Dremel job.
- Juniper (Google's inner data network) translates and helps Dremel jobs read data on the Colossus system.
Redshift vs. BigQuery
To compare the two, we're going to go over three primary categories: pricing, performance, and manageability.
Price: Redshift vs. BigQuery
Redshift's pricing model is extremely simple. For the purposes of this comparison, we're not going to dive into Redshift Spectrum* pricing, but you can check here for those details.
*Redshift Spectrum allows you run Redshift queries directly against Amazon S3 storage — which is useful for tapping into your data lakes if you use Amazon simple storage for your business needs.
With Redshift, you can choose from either Dense Compute or the large Dense Storage. The cheapest node you can spin up will cost you $0.25 per/hour, and it's 160GB with a dc2.large node. Dense Storage runs at $0.425 per TB per hour. This cost covers both storage and processing. Redshift So, the lowest price you can get on Redshift is $306 per TB per month. And, you can pay upfront for massive discounts.
This makes RedShift interesting to work with. If you can calculate your run-times and how often you'll need to spin up each node, you can cut costs dramatically — especially if you pay upfront. Since, most businesses aren't going to be constantly running their Redshift nodes, getting granular is usually in your best interest.
For example, you may only run Redshift during the day when people are interacting with your stack or service. If that's the case, you can adjust your upfront buying habits to reflect that behavior.
BigQuery's pricing is much more complicated. On the surface, BigQuery looks cheaper. Storage costs $20 per TB per month, a good $286 cheaper than Redshift. But, BigQuery charges separately for storage and querying. Queries cost $5/TB. So, while storage is cheaper, query costs can add up quickly.
There are some pros and cons to this method. Really, BigQuery is perfect for a certain type of customer. Let's say your business deals with spikey workloads. You run rapid queries a few times a day. BigQuery would be a far better option since you have to pay by the hour for Redshift. BigQuery may also be the best solution for data scientists running ML or data mining operations — since you're dealing with extremely large, spikey workloads.
BigQuery costs $20 per TB per month for the storage line and $5 per TB processed on that storage line.
RedShift costs $306 per TB per month for storage AND unlimited processing on that storage.
There isn't a winner here. RedShift is more economical for everyday data warehouse operations for most businesses. But, BigQuery is better for businesses looking to do data mining or those who deal with extremely variant workloads.
Let's look at a few examples:
Example 1: Let's say that you only run queries around 5% of your day. In this case, BigQuery is probably going to be more cost-effective since you're paying for query processing on-demand. Since your paying $5/TB of data processed, you may only process three 100GB chunks during a day. This would cost you $1.50 plus the $0.70 for storage. You see this all the time with businesses that are mainly using their data warehouse to perform data mining jobs in chunks. And, this also makes BigQuery valuable for data scientists who are running jobs a few times a day.
Example 2: Let's say your business wants a day-to-day warehouse to help with your sales or marketing stack. In this case, you need constant run-time and the ability to perform queries hundreds or thousands of times per day. Redshift is probably going to be cheaper since you aren't going to be charged for each of those queries. So, let's say each of those hundreds of queries processes 50GB. You would be paying $5/TB with BigQuery, and your costs would add up rapidly. With Redshift, you're simply charged for how long you use your nodes.
Performance is tricky when it comes to Redshift vs. BigQuery. Since BigQuery simply abstracts prices based on how much data you process, you're not locked into a specific resource when you run a query. Redshift, on the other hand, is limited by the node you're running. But, that's not the only factor that goes into query performance.
The size of your data table, schema complexity, and the number of concurrent queries (50 is the max for both) that you're running also make a massive difference. There have been plenty of benchmarks comparing the two over the years. But, none of those benchmarks are particularly helpful in a broad sense.
To get a sense of how absurd some of the benchmark wars are between BigQuery and Amazon, here is a list of the drama.
- Google presented a TPC-H benchmark at CloudAir in San Francisco in 2016 that showed BigQuery outperforming Amazon (they decided to only use one of the performance metrics instead of all 26.)
- Amazon (very sarcastically) rebutted claiming that Google was cherrypicking the query and ran a similar TPC-H benchmark that showed Redshift outperforming BigQuery on almost all tests (they conveniently decided to use an 8-node DC1.8XL which runs at about $20k a month)
- An independent researcher decided to run a similar test on taxi cab data, which showed that BigQuery was 43x faster than Redshift (the research is flawed due to node selection and query type.)
- Then, more back-and-forth between the two commenced.
- At this point, about 500 private companies publish their own benchmarks to cherrypick the results they need to sling their product.
That's the unfortunate state of industry benchmarks. We could spin up a query and run a benchmark, but it would have its own issues with generalizability. And, there are certainly cases where both solutions outperform each other. The complexity of the schema, joins, resources, tables, etc. are too diverse to give you a grounded answer on benchmark performance.
So, let's talk about what each one is really good at.
In our experience with clients, Redshift is great at handling everyday business processes. This means spinning a node during work hours for BI tools and interfaces. It's less expensive, has plenty of power to handle semi-complex schemas, and it's easy-to-use.
BigQuery is great at handling niche business workloads that query big chunks in a small timeframe and for data scientists and ML/data mining.
In many cases, the difference between the two is going to depend upon your Redshift resources. So, if you're paying for a single dc2.large node, BigQuery is most likely going to outperform Redshift. But, if you're spinning up an expensive 8-node DC1.8XL, Redshift is probably going to outperform BigQuery.
The devil is in the details.
When we start to talk about manageability, things, again, get complex. The vast array of features provided by both Redshift and BigQuery make extrapolating ease-of-use incredibly complicated.
We're going to focus on 4 key layers of manageability. But, there are certainly additional variables (like a million of them) to take into account.
Here's what we'll cover.
- Data types/updates and deletes
Data Types/Updates and Deletes
Redshift supports standard SQL data types and BigQuery works with some standard SQL data types and a small range of sub-standard SQL. One of the biggest benefits of BigQuery is that it treats nested data classes as first-class citizens due to its Dremel capabilities. With Redshift, you have to flatten out your data before running a query.
Both of them can handle updates and deletes when something goes wrong in the query. Since BigQuery and Redshift are append-only, many assume they can't do updates and deletes. They can. On BigQuery, the update and delete process exists, but it's relatively expensive, and your options are limited. So, it's not a widely used feature. With Redshift, you can reclaim tables with Postgre Vacuuming (which has its own host of complications) so update and delete support is typically better with Redshift.
Also, Redshift gives users the ability to roll back on transactions, which BigQuery doesn't.
Out-of-the-box, BigQuery is much simpler to use than Redshift. You don't have to perform many tweaks, cluster management is a breeze, and the complexities of database config, etc. are handled by BigQuery. That being said, Xplenty abstracts the complexities of Redshift away users with easy-to-perform workflows and integrations, so Redshift doesn't have to be difficult to use.
When it comes to security, both systems are comparable. Redshift uses Amazon IAM for identity, and BigQuery uses Google Cloud IAM. Both services work perfectly for almost all business scenarios. Google does have great B2B identity management with OAuth, so you can give identity controls to 3rd parties without introducing them into your entire ecosystem.
Both Google and Amazon (unsurprisingly) have a wealth of integrations available. Almost every major BI and data analysis tool runs perfectly well with both warehouses. We won't go in-depth with this section. Sure, Redshift is built on a Postgres fork, so it originally had more native integrations, but the playing field has leveled simply due to the sheer volume of warehouse transactions that Google processes (BI tools don't want to miss out on the revenue stream.)
Both BigQuery and Redshift are incredible data warehouse systems that can help businesses redefine their daily workflows. There are some differences, but there are far more similarities. Probably the biggest consideration for most businesses is price. Redshift is a little easier to forecast pricing with simple due to its on-demand, by-the-hour nature. But, in many business scenarios, BigQuery's $5/TB query cost may make more sense.