A data vault is a relatively new design methodology for data warehouses. Data vaults store raw data as-is without applying business rules. Data transformation happens on-demand, and the results are available for viewing in a department-specific data mart.
While a traditional data warehouse structure relies on extensive data pre-processing, the data vault model takes a more agile approach. This can help tackle difficult use cases, although it’s not always easy to implement.
Data Vault vs. Data Warehouse
In the traditional data warehouse model, data needs to be ready to use before loading. This usually means the following steps:
Extract: A process (such as ETL) that pulls raw data from a source.
Transform: The ETL transforms the raw data, changing it so that it fits within the warehouse schema.
Load: Now that the raw data is in the standardized format, the ETL loads it into the data warehouse.
Usage: When a person or process needs to access the data, they can access the warehouse directly.
There are many advantages to this approach. It guarantees that everything stored in the data warehouse is clean, valid, and in a predictable format.
However, this doesn’t work for every scenario. For example, an organization might have a dozen different data sources, each with unique structures. If the organization wants to unify these sources, it will need to either simplify the data or create an incredibly complex warehouse schema.
A data vault tackles this problem by moving transformation to a later stage in the process. Now, the process looks like this:
Extract: The ETL pulls raw data from the source as normal.
Transform: The ETL performs some basic transformations, such as eliminating corrupt values.
Load: The ETL loads the data to a specific part of the data warehouse, known as the raw data vault.
Application of Rules: The warehouse stores business rules separately from the data. When required, the warehouse can apply these rules to create a new, transformed version of the data. The output sits in another part of the warehouse, known as the business vault.
Data Marts: Users access data through role-specific data marts. These are virtual marts, created by using views of data in the business vault.
So, while the data warehouse only holds one version of data, the data vault model holds multiple versions. This approach has advantages and disadvantages.
Advantages of the Data Vault Approach
As is clear from the steps above, the data vault approach adds a lot of flexibility to the data integration process. Here are some of the other advantages:
Raw Data Retention
One problem with the data warehouse model is that it discards the original data. This isn’t necessarily an issue with planning; there's just no way to view the raw data without going back to the original source. In a data vault, the raw data co-exist with transformed data.
Because of the one-size-fits-all approach, data warehouse designers may end up creating ridiculously complex schemas. This can affect performance. Plus, data vaults are difficult to support in the long term. The model’s primary advantage is a more responsive approach to data transformation, which helps to keep things simple.
Decoupling of Rules and Data
Data vaults store their business rules separately, so they’re easier to update. When the organization wants to make a change to data transformations, they simply apply new business rules.
Department-Specific Data Transformations
Each department within a business will have different data requirements. For instance, sales might need customer data in one format while operations need it in another. This is a challenge with the universal schema of a traditional data warehouse. A data vault, however, allows each department to set up its own business rules.
The data vault strategy works well for big companies with diverse data needs. However, it’s not without drawbacks.
Disadvantages of a Data Vault
A data vault isn’t always the right approach to data warehousing issues. There are potential hurdles that businesses need to consider.
Increased Resource Usage
Data vaults perform a greater deal of post hoc data processing, which can impact overall performance. Most data vaults work around this problem by storing outputs from transformations so users can refer to them at a later date.
Expanded Data Storage
A data vault inherently holds more data than a traditional warehouse because it has multiple versions of everything. Over time, this can amount to a substantial difference in size, which may drive up warehousing costs.
Implementing a data warehouse is usually a straightforward process, requiring very little database expertise. Thanks to no-code ETL, nontechnical users can create a data pipeline with complex transformations. A data vault typically requires much more expertise. Companies setting up a data vault may need a highly skilled engineer to get them off the ground.
Slower Data Mart Results
Critics of the data vault model argue that it simply pushes complexity further down the line since the output from these vaults happens via data marts. Marts sometimes need to pull data from multiple business tables, which can cause increasingly complex SQL JOIN statements.
Data Vault Terminology
Data vault modeling uses some items not found in a traditional warehouse. These are:
- Hubs: Data entities that are relatively stable over time, such as customers and products
- Links: Connections between hubs, such as a record of when a customer purchased a product
- Satellites: Additional attributes of a hub or a link that may change as time goes on
Using this structure gives data vault designers more flexibility in designing their tables.