Hadoop Data Integration 101

Hadoop Data Integration 101

(Public domain image)

Last year Cloudera published a blog post on Big Data’s new use cases: transformation, active archive, and exploration. These ideas weren’t new: people were already using Big Data and Hadoop this way.

Although archiving with Hadoop is still possible, these days it isn’t the best option on the cloud. Amazon S3 and other cloud services are better suited for this task because they offer superior storage at cheaper rates. Hadoop integrates with them seamlessly, so data can be processed on the cloud as though it was on local HDFS storage.

Sign up for a free account to see how Xplenty makes it easy to integrate your data using Hadoop.

Transformation and exploration, however, are still valid use cases, and there is one more which isn’t explicitly mentioned in Cloudera’s post - data integration. Data integration means bringing data in from multiple sources to one repository. For example, client data may be available in an organization’s CRM and ERP where it is stored using different schemas, or with no schemas at all. If any analysis needs to be performed on the company’s clients, the data needs to be in one place with a unified schema. That’s what a data integration tool is used for, and Hadoop can be an excellent one for several reasons.

Hadoop’s Advantages

Schema on Read

Because data on Hadoop is schemaless, Hadoop can handle data which is structured, semi-structured, or even unstructured. This means that no matter how the data is stored, with or without a schema, Hadoop can handle it.

Data Sources

Data can be pulled into Hadoop from different types of sources including relational databases, NoSQL databases, and cloud storage. This can be done using various tools such as Sqoop or connectors such as MongoDB’s for Hive or Pig, and in some cases writing custom code. As mentioned above, cloud storage services already integrate with Hadoop.

Scalability

Hadoop scales horizontally using off-the-shelf servers. If more data sources need to be added and the current hardware can’t handle it, no problem, more machines can be added to the cluster for more firepower. This is especially true in the cloud where instances can start and stop as needed with a few clicks of a mouse button.

Price

Because Hadoop clusters consist of commodity machines, and it scales horizontally, using Hadoop for data integration can be much cheaper than shelling out big bucks for specialized tools. Prices can be reduced even further by running Hadoop on the cloud. There’s no need to keep any bare metal servers around, and instances can be loaded only when they are needed to process data and shut down immediately when they’re done. With pay-as-you-go business models, great elasticity, and no need to hire admins for maintenance, using Hadoop-as-a-Service is really cost-effective.

data-integration-hadoop2.jpg

(Public domain image)

Hadoop’s Disadvantages

Performance

There is a disadvantage to using Hadoop versus traditional tools. Data integration tools are purposely built to handle just that, data integration. Hadoop has some overhead for handling the data, for example, when replicating it across the cluster. Therefore Hadoop’s performance can be a bit slower compared to dedicated tools.

Ease of Use

Another issue is the steep learning curve for Hadoop and coding MapReduce jobs in Java. This can be made easier with Hive or Pig which provide more familiar SQL like languages. It can get even easier with Hadoop-as-a-Service solutions on the cloud that provide simple user interfaces to create clusters, design data flows, and run Hadoop jobs without any code at all (hey, that’s what we do!). But if you insist on running your own data integration show, you’ll need to take some time to learn about Hadoop or hire an expert.

Small Data

Because Hadoop is meant to handle data in the terabyte and petabyte range, it may not be the right solution to process small data. However, as your organization grows, so will the need to collect more data from different sources about every aspect of the business. If you don’t go for a scalable solution like Hadoop now, you’ll need to switch to Hadoop later. Might as well save yourself the headaches of upgrading the company’s data integration infrastructure and use Hadoop from the beginning.

Summary

One of the use cases for Hadoop is to use it as a data integration platform. Hadoop can integrate all types of data, scale easily, and for a cheap price, especially on the cloud. Performance may not be as good as dedicated tools and a learning curve may be necessary, it’s well worth investing in Hadoop for data integration.


Integrate Your Data Today!

Get a 7-day free trial. No credit card necessary.