(Public domain image)
In his recent article "Turbocharge Your Porsche - Buy An Elephant", Bill Inmon, "the father of data warehousing", criticizes Cloudera for associating Big Data with the data warehouse, two totally unrelated terms according to him. His old rival, Dr. Ralph Kimball, is taking the opposing view by presenting a webinar with Cloudera about building a data warehouse with Hadoop.
This marks a new round in the fight between these two academic geezers, a decades long argument over what is a data warehouse and how it should be implemented.
Top-down or Bottom-up?
Inmon and Kimball published two radically different approaches in the 1990s on how an organization should manage its data for reporting and analysis.
Inmon’s approach, also called top-down, is to have "one version of the truth", a great entity that contains all the information for the entire enterprise in one place, an entity called the data warehouse. According to him, the data warehouse should operate in a relational format and store all of the organization’s atomized data. Once the data warehouse is fully designed and put in place, only then can small data marts be added for different departments to query data from the central data warehouse and store it in various dimensions.
Kimball sees this differently. He suggests that an organization should first build small data marts for each department. The data marts should contain facts and dimensions relevant to the business area and store them in a star or snowflake schema. As far as Kimball is concerned, the data warehouse is essentially a union of all the data marts. Accordingly, his version is called bottom-up.
Their methodologies have evolved over the years. Inmon’s latest version, DW 2.0, allows room for unstructured data as part of the data warehouse while Kimball talks about eventually integrating the data marts into one data warehouse. In a presentation made by Inmon himself, he disses Kimball for only realizing now what his approach suggested over 20 years ago.
Why does Inmon criticize Cloudera for mixing up data warehouses with Big Data? Because according to him, a data warehouse is a methodology while Big Data is a technology. Therefore, these terms are not in the same category and cannot be compared with one another.
(Public domain image)
Kimball's approach of data marts seems to be more popular beyond the walls of the academia since companies prefer to start with something small that works rather than spec endlessly only to create a monster. Sometimes there is a data warehouse in place. It’s usually implemented by a relational database which is queried directly and used for online analytical processing (OLAP).
Although Inmon argues that a data warehouse is just an architecture, people use the term on a day to day basis to refer to an actual technology (e.g. "Our data warehouse isn’t fresh - the nightly process failed again!"). In that sense, Apache Hadoop could be part of the data warehouse, for example, as cheap data storage or as part of the data processing performed before analysis.
Ironically, Big Data may fulfill the vision that Inmon preaches - a central repository with one version of the truth where structured and unstructured data can all be stored together. Inmon insists on seeing a data warehouse as distant from Big Data as a Porsche is from an elephant. But they are more like commercial jet planes and the huge Airbus A380 - the airliner with the largest capacity that can handle today’s busy air travel needs. Also, if you are going to have one central data warehouse with all the information, it is going to have to handle data that comes in high volume, velocity, and variety. But wait a minute, isn’t that the very definition of Big Data? Why yes, it is! Inmon himself argues in his latest architecture for the need to store a variety of data as part of the data warehouse. If so, why would Inmon protest so harshly against mentioning Big Data and data warehouse in the same sentence? Could there be another reason?
Something Smells Elephantie
At the end of February, around the time when Inmon published his latest article bashing Cloudera’s so-called mix of Porsches and elephants, Cloudera announced their webinar with Kimball. Could the timing for Inmon’s article and Kimball’s webinar be a coincidence? I don’t think so. This time it’s personal.