Comparing Hive with HBase is like comparing Google with Facebook — although they compete over the same turf (our private information), they don't provide the same functionality. And, like Google and Facebook, plenty of people use both Hive and HBase.
But, things can get confusing for the Big Data beginner when trying to understand the differences between Hive and HBase and their use cases.
Here's the big spoiler — they're both fantastic. And you can use them together! In fact, Facebook runs both Hive and HBase to give you access to all of those profiles at lightning speeds. There is some overlap in their functions. But, they each have unique use cases where they shine.
Today, we're going to demystify Apache Hive and Apache HBase — two incredible Apache big data tools.
Table of Contents:
What Is Apache Hive?
Apache Hive is a data warehouse system that's built on top of Hadoop. It provides data summarization, analysis, and query to large pools of Hadoop unstructured data. You can query data stored in Apache HDFS — or even data stored in Apache HBase — which is executed via MapReduce, Spark, or Tez.
Apache Hive uses an SQL-like language called HiveQL (or HQL) to query batch MapReduce jobs. Hive also supports ACID transactions, like INSERT/DELETE/UPDATE/MERGE statements. As of update 3.0, Hive added some additional functionalities to this by reducing table schema constraints and giving access to vectorized query.
In a nutshell, Apache Hive provides SQL features to Spark/Hadoop data (MapReduce's Java API isn't exactly easy to work with), and it acts as both a data warehouse system and an ETL tool with rich integrations and tons of user-friendly features. Like many similar offerings (e.g. Apache Pig) Hive can technically handle many different functions. For example, instead of writing lengthy Java for a MapReduce job, Hive lets you use SQL. Your reason for utilizing Hive in your stack will be unique to your needs.
What is Apache HBase?
Apache HBase is a NoSQL key/value store which runs on top of HDFS or Alluxio. Unlike Hive, HBase operations run in real-time on its database rather than MapReduce jobs. So, you have random access capabilities — something that's missing from HDFS. Since HDFS isn't built to handle real-time analytics with random read/write operations, HBase brings a ton of functionality to HDFS. You can set it as a data store for real-time data being processed via Hadoop. And, you can integrate it with MapReduce. Even better, you can integrate it with Hive and MapReduce to gain SQL functions.
HBase is partitioned to tables, and tables are further split into column families. Column families, which must be declared in the schema, group together a certain set of columns (columns don't require schema definition). For example, the "message" column family may include the columns: "to", "from", "date", "subject", and "body". Each key/value pair in HBase is defined as a cell, and each key consists of row-key, column family, column, and time-stamp. A row in HBase is a grouping of key/value mappings identified by the row-key. HBase enjoys Hadoop's infrastructure and scales horizontally.
In a nutshell, HBase can store or process Hadoop data with near real-time read/write needs. This includes both structured and unstructured data, though HBase shines at the latter. HBase is low-latency and can be accessed via shell commands, Java APIs, Thrift, or REST. HBase is often used as a storage layer in Hadoop clusters, and massive brands like Adobe leverage HBase for all of their Hadoop storage needs.
Hive Core Features
Hive can help the SQL savvy query data in various data stores that integrate with Hadoop. Since it's JDBC compliant, it also integrates with existing SQL based tools. Running Hive queries could take a while since they go over all of the data in the table by default. Nonetheless, the amount of data can be limited via Hive's partitioning feature. Partitioning allows running a filter query over data that is stored in separate folders, and only read the data which matches the query. It could be used, for example, to only process files created between certain dates, if the files include the date format as part of their name.
Here are a few of Hive's features:
- It uses SQL!
- Fantastic Apache Spark and Tez Integration
- You can play with User Defined Functions (UDF)
- It has great ACID tables with Hive 3+
- You can query huge Hadoop datasets
- Plenty of integrations (e.g., BI tools, Pig, Spark, HBase, etc.)
- Other Hive-based features like HiveMall can provide some additional unique functions.
HBase Core Features
HBase works by storing data as key/value modeled after Google’s Bigtable. It supports four primary operations: put to add or update rows, scan to retrieve a range of cells, get to return cells for a specified row, and delete to remove rows, columns or column versions from the table. Versioning is available so that previous values of the data can be fetched (the history can be deleted every now and then to clear space via HBase compactions). Although HBase includes tables, a schema is only required for tables and column families, but not for columns, and it includes increment/counter functionality.
Here are a few of HBase's features:
- It supports key-value!
- It's a NoSQL database that supports random read/write operations.
- Medium Object (MOB) support
- HBase supports coprocessors. This is incredibly useful for computing massive amounts of data and operates similar to a MapReduce job with some added benefits.
- Lets you leverage Apache Phoenix!
- You can perform scan operations.
To start, Hive has very basic ACID functions. They were added in Hive 0.14, but they don't have the maturity of offerings like MYSQL. That being said, there is still ACID support, and it gets significantly better each patch.
Hive queries also typically have high latency. Since it runs batch processing on Hadoop, it can take minutes or even hours to get back results for queries. Plus, updating data can be complicated and time-consuming.
Hive isn't the best at small data queries (especially in large volume) and most users tend to lean on traditional RDBMSs for those data sets.
HBase queries are written in a custom language that needs to be learned. SQL-like functionality can be achieved via Apache Phoenix, though it comes at the price of maintaining a schema. Furthermore, HBase isn't fully ACID compliant, although it does support certain properties. Last but not least — in order to run HBase, ZooKeeper is required — a server for distributed coordination such as configuration, maintenance, and naming.
HBase can process small data via co-processing, but it's still not as useful as an RDBMS.
Hive Use Cases
Hive should be used for analytical querying of data collected over a period of time — for instance, to calculate trends or website logs.
We typically see two Hive use cases.
- As an SQL query engine for HDFS: Hive can be a significant source of your SQL queries. You can leverage Hive to tackle Hadoop data lakes and connect them to your BI tools (like Tableau) for visibility.
- As a table storage layer with HBase, Pig, Spark, or Tez. Tons of HDFS tools use Hive as a table storage layer. Technically, this is probably its largest global use case.
HBase Use Cases
HBase is perfect for real-time querying of Big Data. Facebook used it for messaging and real-time analytics (now are using MyRocks Facebook's Open Source project). Hive should not be used for real-time querying since it could take a while before any results are returned.
HBase is primarily used to store and process unstructured Hadoop data as a lake. You can also use HBase as your warehouse for all Hadoop data, but we primarily see it used for write-heavy operations.
Real-Life Examples of Hive Usage
There are over 4,330 companies brands that leverage Hive currently. This is fewer than use HBase, but it's still a lot of brands — especially since most companies are still running SQL stacks.
- Scribd uses Hive typical data science use cases with Hadoop. This includes machine learning, data mining, and ad-hoc querying for BI tools. Really, Scribd uses Hive as part of their overall Hadoop stack — which is where it most comfortably fits. You can put Hive and HBase on the same cluster for storage, processing, and ad-hoc queries.
- MedHelp uses Hive for their Find a Doctor function. They are processing millions of queries a day on their Hadoop stack, and Hive handles it like a pro.
- Last.fm also uses Hive for ad-hoc queries. Again, this is where Hive shines. If you need ad-hoc queries on Hadoop, turn to Hive.
- HubSpot, hi5, eHarmony, and CNET use Hive for... you guessed it, query. It's safe to say this is its most commonly discussed use case. Again, we would venture to bet that it is mostly used as a table storage layer. But, no one really talks about that.
Real-Life Examples of HBase Usage
Almost all of these cases will be using HBase as their storage and processing tool for Hadoop — which is where it naturally fits.
- Adobe has been running HBase since its launch. Their first node fired up back in 2008, and they currently leverage HBase for their 30 HDFS nodes. They use it for both internal structured data and unstructured external data.
- Flurry runs 50 HDFS nodes with HBase, and it uses HBase for tens of billions of rows.
- HubSpot primarily uses HBase for their customer data storage. They also use Hive to run queries on that HBase data as part of their HDFS stack.
- Twitter uses HBase in their Hadoop stack as well. And it's used for internal data from user searches.
- Streamy switched from SQL to a Hadoop stack with HBase. They claim to be able to process faster than ever before.
- Sematext (who created SMP for HBase) uses an HBase and MapReduce stack. Again, these two work well together (which is often leveraged via Hive) since they perfectly complement each other's pros and cons.
- Well over 10,000 businesses leverage HBase. And most of them are large. In the current tech ecosystem, big brands tend to leverage Hadoop more often, so HBase tends to be in some big stacks (e.g., TCS, Marin Software, Taboola, KEYW Corp, etc.)
It's safe to say that both Hive and HBase are both incredible tools. Massive companies like Google, Twitter, Facebook, Adobe, and HubSpot lean on both Hive and HBase in their Hadoop stack. And both tools are extremely common in Hadoop clusters. It's important to understand that both of these tools can perform some of the same functions. But, they are vastly different tools that have mostly unique use cases in the real world.
Hive and HBase are two different Hadoop based technologies — Hive is an SQL-like engine that runs MapReduce jobs, and HBase is a NoSQL key/value database on Hadoop. But hey, why not use them both? Just like Google can be used for search and Facebook for social networking, Hive can be used for analytical queries while HBase for real-time querying. Data can even be read and written from Hive to HBase and back again.
Are You Looking for An ETL Tool for Your HDFS Stack?
Are you looking for an ETL tool for your Hadoop cluster? Don't worry! We've got your back.
Xplenty is an easy-to-use, cloud-based ETL tool that has strong native HDFS integrations. Want to learn more about our incredibly simple and effective ETL solution? Contact us!
Originally published: May 10th, 2019.