("This is the life" photo by Mr.Thomas, some rights reserved)
Consider for a moment, if you will, plastic patio furniture. Plastic Fantastic is a global manufacturer with several factories, warehouses, and plenty of stores. One can only imagine the sheer amount of data resulting from sales, production, suppliers, and finances. Everything that happens, from purchase and onward, to these chairs, tables, and cupboards in all corners of the world is measured.
The company’s BI analysts check the figures and look for insights. As their BI developer, you’re in charge of Plastic Fantastic’s ETL:
Extract data from the various operational stores to the staging environment.
Transform the data using an ETL tool, thus building facts and dimensions.
Load the transformed data into the production data warehouse.
There’s not much data in the beginning. The ETL process runs overnight and is completed by dawn. However as the company grows, the data thickens, and the process takes progressively longer. One morning you almost choke on your coffee when you see it’s still not done, and all the while the phone keeps ringing with complaints about the system being stuck and the data is getting stale. Clutching your stress ball, you sit at your desk and think - "How do I improve the ETL performance?"
1. Tackle Bottlenecks
Before anything else, make sure you log metrics such as time, number of records processed, and hardware usage. Check how many resources each part of the process takes and address the heaviest one. Usually it will be second part, building facts and dimensions in the staging environment. Wherever your bottleneck may be, take a deep breath and dive into the code. May the force be with you.
2. Load Data Incrementally
Loading only the changes between the previous and the new data saves a lot of time as compared to a full load. It’s more difficult to implement and maintain, but difficult doesn’t mean impossible, so do consider it. Loading incrementally can definitely improve the ETL performance.
3. Partition Large Tables
If you use relational databases and you want to improve the data processing window, you can partition large tables. That is, cut big tables down to physically smaller ones, probably by date. Each partition has its own indices and the indices tree is more shallow thus allowing for quicker access to the data. It also allows switching data in and out of a table in a quick meta data operation instead of actual insertion or deletion of data records.
4. Cut Out Extraneous Data
It’s important to collect as much data as possible, but not all of it is worthy enough to enter the data warehouse. For instance, images of furniture models are useless to BI analysts. If you want to improve the ETL performance, sit down and define exactly which data should be processed and leave irrelevant rows/columns out. Better to start small and grow as you go as opposed to creating a giant data monster that takes eons to process.
5. Cache the Data
Caching data can greatly speed things up since memory access performs faster than do hard drives. Note that caching is limited by the maximum amount of memory your hardware supports. All that plastic furniture big data might not fit in.
6. Process in Parallel
Instead of processing serially, optimize resources by processing in parallel. Sadly, this is not always possible. Sort and aggregate functions (count, sum, etc.) block processing because they must end before the next task can begin. Even if you can process in parallel, it won’t help if the machine is running on 100% CPU the entire time. You could scale up by upgrading the CPU, but it would scale only to a limit. There’s a much better solution.
7. Use Hadoop
Apache Hadoop is designed for the distributed processing of large data over a cluster of machines. It uses HDFS, a dedicated file system that cuts data into small chunks and optimally spreads them over the cluster. Duplicate copies are kept and the system maintains integrity automatically.
MapReduce is used to process tasks (Hadoop 2 or YARN allows more applications). Each MapReduce job works in 2 stages:
Map - filtering & sorting data - tasks are divided into sub-tasks and processed in parallel by the cluster machines.
Reduce - summary operations - data from the previous stage is combined.
Hadoop is optimized for distributed processing analytics. Sort and aggregate functions execute in parallel on an entire cluster. If you’re CPU’d out, Hadoop lets you scale out rather than up - just add more machines to the cluster and you’re good to go. Running Hadoop-as-a-Service on the cloud makes scaling even easier since you can add more power with a few clicks.
Moving heavy processes to Hadoop could speed up your ETL and make your bosses happier. There are a few limitations though:
Hadoop works best when the data is stored locally on the cluster via HDFS. If you run Hadoop on the cloud, the data can be saved in an object store (e.g. S3 on AWS or Softlayer Object Storage) and accessed when necessary. It may decrease performance for large clusters because of the network communication overhead.
Hadoop doesn’t read relational data directly. Use Sqoop to import/export it into Hadoop.
Running Hadoop on premise requires a large upfront expenditure for implementation and machines. Running SaaS Hadoop in the cloud solves this issue.
You’ll need to train or hire engineers skilled in Java, Linux, and distributed systems.
There are several measures you can take to bring the data processing window up to speed - concentrate on bottlenecks, load data incrementally, partition large tables, process only relevant data, try caching, and use parallel processing. Ultimately, give Hadoop a try since it is designed to solve the problem of processing humongous amounts of data. Soon you’ll be able to lean back in your plastic chair, put your feet up, and enjoy the sound of data crunching.