Data modeling is the process of analyzing data objects and figuring out the relationships between them. In a given database for a SaaS company, data objects can refer to vendors or customers. To put it simply, data modeling is a theoretical representation of data objects and how they should be stored in a system. In that context, modeling your data can define what transformations need to be applied to it before it is loaded into your data warehouse.
In the age of big data, businesses are able to learn more than ever about their customers, identify new product opportunities, and so on. However, for accurate data analysis, it is imperative that analysts understand the relationship between different pieces of information and how data flows in a system. The two are crucial for data aggregation across disparate tools and platforms. Hence, the case for data modeling.
Table of Contents:
Why You Need ETL Data Modeling
The primary purpose of any data warehouse is to make it easy for analysts to query data and gather insights. In order to achieve that, data should be reliable and accessible.
Data modeling takes care of the reliability part of it through the visual representation of information. More specifically, it addresses entity and referential integrity with the help of keys. Entity integrity refers to the reliability of data within an entity. Referential integrity refers to data reliability between entities. Let's explore that with a simple example of two entities (or tables, as they are called in the world of data): vendors and products.
In order to maintain data integrity, it is important that each vendor and each product within their respective table has a unique ID. This is to avoid data duplication. Data modeling achieves that with the help of primary keys. In order to achieve data reliability between vendors and products, data modeling uses foreign keys, which are the primary keys of a table that exist in another table. In our example, the unique key for one vendor can be used to identify the different kinds of products he makes.
Here, it is important to note the difference between 'transformations' of ETL and data modeling. The latter defines the rules for the relationship between tables. ETL applies those rules and checks for anomalies to make sure data is ready for analysis before being loaded into the data warehouse.
Enjoying This Article?
Receive great content weekly with the Xplenty Newsletter!
How ETL Data Modeling Can Drive Business Insights
The end-goal of a data warehouse is to enhance business intelligence. Simply querying your sales figures year-on-year for a certain product is going to give you very basic information on how your business is doing. However, if you can query sales figures against different demographics, or events, for instance, that is when things start to get interesting.
Doing so requires the aggregation of data. Transactional systems aren't the best at aggregation tasks. A transactional system, such as a CRM software or an ERP tool, records transactions or events like sales or page visits. The primary goal of transactional systems is data retention. They are not optimized for data analysis.
Enter the world of dimensional modeling. Dimensional modeling uses two key concepts: fact tables, and dimension tables.
- Fact tables store data that needs to be analyzed. This data is quantitative in nature, such as sales, profits, or cost of goods.
- Dimension tables hold information on the various ways that data in fact tables can be analyzed. It provides a descriptive context to facts in a system.
ETL data modeling, with the help of star schemas, build bridges between facts and dimensions. Star schemas rely on a combination of dimensions to make different dimension tables. Any of these combinations should correspond to exactly one record in the fact table. Doing so achieves two goals - it avoids data duplicity, and it expedites the process of aggregation and analysis.
Modern data warehouses, though, are doing away with star schemas in favor of wide tables with many columns, which is explained in the next section.
Integrate Your Data Today!
Try Xplenty free for 7 days. No credit card required.
Best Practices for ETL Data Modeling
Any kind of ETL data modeling should address the following concerns when it comes to data analysis:
- It should be possible to query data with simple joins
- The data structure should ensure uniqueness of data at all times
- A data model should allow granular querying of data by using periodic and accumulated snapshots of fact tables
Two key considerations form the basis for addressing those concerns:
Materialization is the process of pre-aggregating data in order to improve query response times. Pre-aggregation is in contrast to a star schema that involves several tables. Pre-aggregating data reduces the number of tables and expands their width by increasing their number of columns. This, in turn, makes it easier for end-users to write queries, which addresses the challenge of simple joins.
In the real world, there is often a compromise between materialization and leaving some data unaggregated. As a general rule, any expensive joins in a system (in terms of response time) are good candidates for materialization.
The grain of a table or a view is what a single row in that table or view stands for. For instance, in a table named customers, the grain can be a single customer. Thus, each customer has its own row in the table and there is exactly one row per customer. That is a very simple example of grain in a data system.
Grains can get complex, though. Imagine a table named subscription_status where there is exactly one row per subscriber per status. Different statuses could be renewed, about to expire, and expired. In this case, each subscriber has multiple rows associated with it, reflecting his subscription status.
Grains of a system decide how easy or difficult it is to combine tables and solve data analysis problems. For effective data modeling, it is important to first determine the grain of a table or a view and give it a clear, distinct name.
Challenges in ETL Data Modeling--and the Solution
Conventionally, data modeling has dealt with building relational databases that involve structured data. However, the rise of big data has resulted in large amounts of unstructured data that needs to analyzed. This includes all sorts of machine data, such as data from IoT devices. Furthermore, modern data models have to contend with data repositories distributed across public cloud, private cloud, and on-premise architecture. In essence, datasets are moving faster and getting more complex.
Data modeling algorithms are an effective way to tackle the challenge. Through the use of machine learning, statistical algorithms, and artificial intelligence, the process of data modeling can be automated. It does away with the time-consuming task of defining relations and data structures. Instead, users can straightaway focus on the business end of things--using the data to derive insights.
At Xplenty, we make complex data preparation really simple. Learn more about how our low-code ETL platform helps you get started with data analysis in minutes by scheduling a demo and experiencing Xplenty for yourself.