In “The Monolith”, the fourth episode of Mad Men’s final season, a room-sized computer is installed in the center of the floor. The computer is brought in because a competing ad agency has one, and being an innovation at the time, the client wants it. No matter how creative the agency’s talented creative team got, they could never crunch data and generate sales reports as well as the computer does.
Just as advertising went through major changes in Don Draper’s time, data warehousing is going through changes in our time. Most organizations still use a good old RDBMS as a data warehouse. However, designing it is a serial method where everything needs to be defined in advance: the business questions that require answers, the processes that handle the data, and the architecture. Because the data warehouse is a big relational database, the schema has to be defined before the data can be loaded. This is also known as schema on write. Even the slightest change in data sources or questions would lead you back to the UML drawing board to start the design process all over again.
This fails to overcome the challenges of the age of Big Data.
Big Data Challenges
Today’s data arrives from various sources. Some of these sources, such as application logs, sensor data, and social data, may be semi-structured or completely unstructured. A traditional relational data warehouse would have a hard time dealing with them. Also, with a variety of data that arrives in great velocity, it’s hard to ask all the right business questions in advance - plenty of new questions will pop up during analysis. Having the flexibility to ask new questions and make changes mid-process gives businesses a leg up on those that can’t or won’t. Therefore, today’s data warehouse must handle changing requirements and data in different formats from several sources at great speed.
Enter Hadoop. Since Apache Hadoop is a distributed file storage system that requires no schema, data can be stored immediately without having to define anything - this is better known as schema on read. As new requirements pop up, they can be satisfied without having to redesign the entire system, while new sources of data can be easily integrated. Of course, Hadoop can handle all data types from structured to unstructured, in huge quantities, and at great velocity.
Hadoop + Data Warehouse Use Cases
With the added power of Apache Hadoop, several options are available for enhancing the DWH.
Using this method, the traditional data warehouse is kept for analytical purposes. However, the ETL process of extracting the data from various sources, transforming it, and loading it to the data warehouse are all offloaded to Hadoop. This helps the DWH environment to deal better with Big Data. Furthermore, today there are Hadoop services, such as Xplenty, that run in the cloud and process your data in parallel, helping the DWH to deal with Big Data even better.
Inexpensive Raw Data Archive
Another option is to use Apache Hadoop as a cheap archive for raw data. Because Hadoop can store all types of data, and it scales by adding off-the-shelf servers to the cluster, extra storage space can be added easily and for relatively cheap. Data can be kept there indefinitely and used when necessary.
Operational Data Store (ODS)
This is a different approach to the entire data warehouse design. Here there is an extra entity called the operational data store which lies between the operational systems (the systems that process the day-to-day transactions) and the data warehouse. Raw data from the operational systems is simply copied to the ODS and kept for a certain period of time, usually between a week to a month. The ETL process then runs between the ODS and the data warehouse.
(Public domain image)
The ODS architecture has several advantages:
Better performance - a full ETL executes additional processes beyond copying data, for example, for data cleansing. These processes may put stress on the operational systems’ resources. Therefore, instead of running a full ETL between the operational systems and the data warehouse, performance is improved by extracting the data to the ODS, and then transforming and loading it to the data warehouse.
Testing - the data warehouse only contains crunched data which answers predefined questions. Since all the raw data is available on the ODS, ad-hoc questions can be tested there long before they are added to the data warehouse.
Full ETL - instead of running the ETL process between the ODS and the data warehouse, it could run on the ODS itself, and then copy the processed data to the data warehouse. This further helps to free DWH resources.
The Wave of the Future?
As it turns out, Hadoop is an excellent ODS solution - it’s perfect for the copy process and running ad-hoc queries as batch processes. In fact, with the cheap and scalable storage that is available on Hadoop, data may be kept there until further notice, an opportunity which allows even more in-depth analysis on the data.
Although using Hadoop as an operational data store is a relatively new development, it’s becoming more and more important. Because the ODS is powered by Hadoop, it enables users to run different types of queries such as traditional batch mapreduce queries, interactive SQL on Hadoop, and high performance stream processing with Spark’s in-memory cluster computing, just to name a few. Due to the appeal of asking flexible questions on bigger data sets, some predict that eventually half of the queries will run on the ODS, with the other half running on the traditional warehouse.
The Downside of Hadoop and How to Overcome It
The big challenge with Hadoop is that it requires specialized and complex skills. Many organizations do not have these skills or only a small number of experts within them do. This is exactly the reason we created Xplenty. Xplenty runs Hadoop clusters for you in the cloud. You literally run these clusters by clicking a button - no knowledge of setting up and managing Hadoop is required.
Additionally, Xplenty lets you define the transformations and other tasks that you want to perform on the data to prepare it for the warehouse with a visual drag & drop interface. You do not need any specialized programming knowledge in MapReduce, Hive or Pig.
Don’t be like Mad Men and buy a computer just to keep up with the Joneses. Do, however, make sure you’re set up with a data warehouse that will handle all of your big data needs (Hint, it’s called Hadoop).