There are 3 critical differences between BigTable and BigQuery:
- BigTable can be described as an OLTP (Online transaction processing) system. Whereas BigQuery can be described as a Business-intelligence/OLAP (Online Analytical Processing) system.
- BigTable is characteristic of a NoSQL system whereas BigQuery is somewhat of a hybrid; it uses SQL dialects and is based on the internal column-based data processing technology - "Dremel".
- BigTable is mutable and has fast key-based lookup whereas BigQuery is immutable and has slow key-based lookup.
Big data is accumulating massive amounts of information each year, and the global data sphere is increasing exponentially. The International Data Corporation (IDC) estimates it will reach 175 zettabytes (175 trillion gigabytes) by 2025. As a result of this exponential growth, engineers have reacted by building cloud storage systems that are highly scalable, highly reliable, highly available, low cost, self-healing, and decentralized. Typically, Cloud storage has two main branches: distributed file systems and distributed databases.
A distributed file system is distributed on multiple file servers or at numerous locations. It allows users of physically distributed systems to share their data and resources by using a Common File System. Clients can access and process data stored on the system as if it were on their machine.
A distributed database is a group of multiple, logically related databases distributed over a computer network.
Google developed the Google File System to meet the growing processing demands they encountered during the early 2000s; more specifically, to address the problems associated with the storage and analysis of vast numbers of web pages (indexing web content). GFS essentially provides efficient, reliable access to data using large clusters of commodity hardware. The MapReduce paper followed in 2004 - outlining a distributed computing and analysis model for processing massive data sets with a parallel, distributed algorithm on a cluster. To mitigate the challenges associated with a large amount of formatted and semi-formatted data, the large-scale database system BigTable emerged from the Google forge - built on top of MapReduce and GFS.
Table of Contents
- Overview of BigTable and BigQuery
- OLTP vs OLAP
- NoSQL vs SQL
- Mutable vs Immutable
- How Xplenty Can Fuel Your Data Mining
Overview of BigTable and BigQuery
BigTable is a petabyte-scale, fully managed NoSQL database service "NoSQL Database as a Service" - supporting weak consistency and capable of indexing, querying, and analyzing massive amounts of data. Ideal for storing vast quantities of single-keyed data with low latency; supporting high read and write throughput at low latency - it is a perfect data source for MapReduce operations. The following are examples of Google products using Bigtable - Analytics, Finance, Orkut, Personalized Search, Writely, and Earth.
BigQuery is a powerful business intelligence tool that falls under the "Big Data as a Service" category, built using BigTable and Google Cloud Platform. It's serverless and wholly managed. As a SQL data warehouse, it is capable of rapid SQL queries and interactive analysis of massive datasets (order of terabytes/petabytes). The platform utilizes a columnar storage paradigm that allows for much faster data scanning plus a tree architecture model that makes querying and aggregating results significantly more manageable and efficient. BigQuery provides the capability to integrate with the Apache Big Data ecosystem. Existing Hadoop/Spark and Beam workloads can read or write data directly from BigQuery.
BigQuery is the external implementation of one of the company's core technologies; code-named Dremel (2006). They share the same foundational architecture. Dremel is essentially a query execution engine and is capable of independently scaling compute nodes to mitigate against computationally intensive queries. However, BigQuery leverages a myriad of other tools as well. Borg, Colossus (successor of Google File System), Capacitor, and Jupiter. If you want to offload data processing workloads using BigQuery, check out Xplenty's tutorial.
Integrate Your Data Today!
Try Xplenty free for 7 days. No credit card required.
OLTP vs OLAP
BigQuery is an in OLAP(Online Analytical Processing) system; query latency is slow; hence the use case is best for queries with heavy workloads such as traditional OLAP reporting and archiving jobs. The design does not encourage OLTP(Online transaction processing ) style queries - to put this into context; small read writes cost ~1.8 seconds while BigTable costs ~9 milliseconds for the same operation. BigQuery supports atomic single-row operations but does not provide cross-row transaction support. Of course, the immutable nature of BigQuery tables means that queries are executed very efficiently in parallel. It is possible to execute reporting and OLAP-style queries against enormous datasets by running the operation on a countless number of nodes in parallel.
The fast read-by-key and update operations make Bigtable most suitable for OLTP workloads. The data model stores information within tables and rows have columns (Type Array or Struct). It is possible to add a column to a row; the structure is similar to a persistent map. Rows have a primary key which is unique for each record; hence the ability to quickly read and update a record. It is possible to perform reporting/OLAP workloads as BigTable provides efficient support for key-range-iteration. However, if interactive querying in an online analytical processing setup is of prime concern, use BigQuery.
NoSQL vs SQL
BigTable is essentially a NoSQL database service; it is not a relational database and does not support SQL or multi-row transactions - making it unsuitable for a wide range of applications. It is only a suitable solution for mutable data sets with a minimum data size of one terabyte; with anything less, the overhead is too high. The main characteristics are that it can scale horizontally (very high read/write throughput as a result) and its key-columns - meaning that there is one key under which there can be multiple columns, which can be updated. Performance suffers if one stores individual data elements more extensive than 10 megabytes. If one needs to store unstructured objects more comprehensively than this, e.g., video files, Cloud Storage is most likely a better option. It is best suited to the following scenarios, time-series data (CPU and memory usage over time for multiple servers), financial data (transaction histories, stock prices, and currency exchange rates), and IoT use cases.
With BigQuery, it is possible to run complex analytical SQL-based queries under large sets of data. However, one can additionally use NoSQL techniques, e.g. it is encouraged to denormalize data when designing schemas and loading data to BigQuery for performance purposes. It is an ample choice when one's queries require a "table scan" or one needs to look across the entire database (sums, averages, counts, groupings). However, there are many limitations; a limited number of updates in the table per day, restrictions on data size per request, and others. The motive behind BigQuery does not intend to substitute traditional relational databases; it focuses on running analytical queries as opposed to basic CRUD operations and queries.
Mutable vs Immutable
Data is immutable within BigQuery; meaning an uploaded object cannot change throughout its storage lifetime once written - the data cannot be deleted or altered for a pre-determined length of time. If an existing record needs to be modified, the partition needs to be rewritten. Hence, updates are slow and costly; this system is ideal for write-once scenarios such as event sourcing and time-series-data. BigQuery is append-only, and this is inherently efficient; BigQuery will automatically drop partitions older than the preconfigured time to live to limit the volume of stored data.
Bigtable stores data in scalable tables, each of which is a sorted key/value map that is indexed by a column key, row key and a timestamp hence the mutability and fast key-based lookup. Each row typically describes a single entity, and columns, which contain individual values for each row. Read and writes of data to rows is atomic, regardless of how many different columns are read or written within that row.
Integrate Your Data Today!
Try Xplenty free for 7 days. No credit card required.
How Xplenty Can Fuel Your Data Mining
Suppose you're suffering from any kind of data integration bottleneck. In that case, Xplenty's automated ETL platform offers a cloud-based, visual, and no-code interface that makes data integration and transformation less of a hassle. Check out Xplenty's hundreds of out-of-the-box integrations here. And if you have any questions, schedule a call with our team to learn how Xplenty can solve your unique ETL challenges.