(Robonaut + Astronaut images by NASA, Public Domain)
Relational databases are here to stay - they are common, integrate with other information systems, do a good job of querying structured data, and people know how to use them. Some believe that adding Apache Hadoop to the organization means death to the good old RDBMS, but that is far from true. Not only can Hadoop and relational databases work in parallel, but also together with one another.
Why Integrate Hadoop with RDBMS
There are several reasons why integration of Apache Hadoop with a relational database may be necessary:
RDBMS Data Sources
Hadoop is great for batch processing Big Data in different formats from various sources. One of these sources could be a relational database. This means there has to be a way to import structured data into Hadoop.
Integrating Hadoop with Data Warehouses
Most data warehouses are based on relational databases, so aggregated data may need to be exported from Hadoop into a relational database.
Storing Archived Data on Hadoop
Another case is using Hadoop’s HDFS as cheap storage for archived data. This data could be pulled from a relational database from time to time and restored back to the DB when necessary
One of the tools that integrates Apache Hadoop with RDBMS is Apache Sqoop. It connects to databases like MySQL, PostgreSQL, or Oracle via JDBC, the standard API for connecting to databases with Java. It also works with non-JDBC tools that allow bulk import or export of data, but with certain limitations (e.g. no support for certain file types).
Sqoop works by running a query on the relational database and exporting the resulting rows into files in one of these formats: text, binary, Avro, or Sequence Files. The files can then be saved on Hadoop’s HDFS. Sqoop also works in the opposite direction, letting you import formatted files back into a relational database.
Sqoop can be really helpful to move data from RDBMS to Hadoop and back again, but several issues could crop up during the process.
Sqoop works best with relational databases that support JDBC. This means the right drivers need to be installed on every machine in the Hadoop cluster.
Different databases support different data types. These data types convert to Java types when exporting with Sqoop, a conversion that may not always run smoothly. For example, delimited text files could have problems saving the float data type in string format, UTF encoding formats might not convert correctly, and NULL values might cause problems. Also, custom data types supported by your DB might not be recognized by Sqoop, in which case one must control type mapping via the command line or write custom Java code to map data correctly.
Sqoop exports data from the DB in parallel using several processes. This is great since it speeds up conversion, but it also means the DB may be under strain while Sqoop is working its magic. Fortunately, the number of map tasks can be configured to control how many processes are used. Also, a table column should be set to split ranges of rows across Sqoop’s processes, otherwise the data may not be exported efficiently because several processes may try to handle the same rows. As long as the relevant column is a primary key, this issue won’t happen.
Processing on Hadoop
Once the data has been exported out of the relational database into files on HDFS, how can they be processed on Hadoop? The answer is that Sqoop integrates with HCatalog, a table and storage management service for Hadoop that allows querying the files imported via Sqoop with Hive or Pig.
Integrating relational databases with Hadoop may be necessary to get data flowing in your organization. Although it needs some fine tuning and has a few limitations, Sqoop is the right tool for the database job.