The Ultimate Guide to Data Warehouse Design

Data warehouses help you run logical queries, build accurate forecasting models, and identify impactful trends throughout your organization.

But, what goes into designing a data warehouse? Whether you choose to utilize a pre-built vendor solution or you're starting from scratch — you'll need some level of warehouse design to successfully adopt a new data warehouse.

What is a Data Warehouse?

A data warehouse is a dumping ground for data from various systems (e.g., sales stack, marketing stack, CRM, etc.) that uses online analytic processing (OLAP) to query that data for better business insights.

So, let's say that you're looking to figure out the overall value of your leads in Salesforce. You could push your Salesforce data into your data warehouse, set up a schema, and run a query that would tell you which of your marketing activities led to your highest-value prospects.

8 Steps to Designing a Data Warehouse

Let's talk about the 8 core steps that go into building a data warehouse.

1. Defining Business Requirements (or Requirements Gathering)

Designing a data warehouse is a business-wide journey. Data warehouses touch all areas of your business, so every department needs to be on-board with the design. Since your warehouse is only as powerful as the data contained within it, aligning department needs and goals with the overall project is critical to your success.

So, if you can't combine all of your sales data with your marketing data, your overall query results are going to be missing some critical components. Knowing which leads are valuable is hinged to marketing data.

Every department needs to understand the purpose of the data warehouse, how it will benefit them, and what kinds of results they can expect from your warehousing solution.

This Requirements Gathering stage should focus on the following objectives.

  • Aligning department goals with the overall project
  • Determining the scope of the project in relation to business objectives
  • Discovering your future needs and current needs by diving deep into your data (find out what data will be useful for analysis) and your current tech stack (where your data is currently siloed / not being put to use?)
  • Creating a disaster recovery plan in the case of system failure
  • Thinking about each layer of security (e.g., threat detection, threat mitigation, identity controls, monitoring, risk reduction, etc.)
  • Anticipating compliance needs and mitigating regulatory risks

You can think of this as your overall data warehouse blueprint. But, really, this phase is more about determining your business needs, aligning those to your data warehouse, and, most importantly, getting everyone on-board with the data warehousing solution.

2. Setting Up Your Physical Environments

Data warehouses typically have three primary physical environments — development, testing, and production. This mimics standard software development best practices, and your three environments will exist on completely separate physical servers.

Why do you need three separate environments?

  • You need a way to test changes before they move into the production environment.
  • Some security best practices require that testers and developers never have access to production data.
  • Running tests against data typically uses extreme data sets or random sets of data from the production environment — and you need a unique server to execute these tests en masse.
  • Having a development environment is a necessity, and dev environments exist in a unique state of flux compared to production or test environments.
  • Production environments will have much higher workloads (your whole business is using it), so trying to run tests or develop in that environment can be stressful for both team members and servers.
  • Data integrity is much easier to track, and issues are easier to contain when you have three environments running. It makes headhunting issues less stressful on your workloads, and data flow in production and testing environments can be stalled without impacting end users.
  • Running tests can often introduce breakpoints and hang your entire server. That's definitely not something you want happening in your production environment.
  • Imagine sharing resources between production, testing, and development. That's not something that you want! Testing, development, and production environments all have different resource needs, and trying to combine all functions into one server can be catastrophic for performance.

Remember, BI development is an ongoing process that really never grinds to a halt. This is especially true in Agile/DevOps approaches to the software development lifecycle, which all require separate environments due to the sheer magnitude of constant changes and adaptations.

You can choose to run more than these three environments, and some businesses choose to add additional environments for specific business needs. We've seen staging environments that are separate from testing solely for Quality Assurance work. We've also seen Demo environments and even Integration environments specifically for testing integrations.

You should absolutely have the core three environments, but you can layer in additional environments to fit into your unique business goals.

3. Introducing Data Modeling

Data modeling is the process of visualizing data distribution in your warehouse. Think of it as a blueprint. Before you start building a house, you want to know what goes where and why it goes there. That's what data modeling is to data warehouses.

Data modeling helps you visualize the relationships between data, and it's useful for setting standardized naming conventions, creating relationships between data sets, and establishing compliance and security processes that align with your overarching IT goals.

Data modeling is probably the most complex phase of data warehouse design. And, there are plenty of data modeling techniques that businesses use for warehouse design.

Before we jump into a few of the most popular data modeling techniques, let's discuss the differences between data warehouses and data marts.

A data warehouse is a system that you store data in (or push data into) to run analytics and queries. A data mart is an area within a data warehouse that stores data for a specific business function. (I put this in to give you an SEO anchor point for the data mart vs data warehouse post you're creating.)

So, let's say that you build your entire data warehouse. That's great! But, your sales team is going to be using that data warehouse in a vastly different way than your legal team. And, certain workflows and data sets are only valuable to certain teams. Data marts are where all of those team-specific data sets are stored, and queries are processed.

Data modeling typically takes place at the data mart level and branches out into your data warehouse. It's the logic of how you're storing data in relation to other data.

The three most popular data models for warehouses are:

  1. Snowflake Schema
  2. Star Schema
  3. Galaxy Schema

You should choose and develop a data model to guide your overall data architecture within your warehouse. The model that you choose will impact the structure of your data warehouse and data marts — which impact the ways that you utilize ETL tools and run queries on that data.

4. Choosing Your Extract, Transfer, Load (ETL) Solution

ETL or Extract, Transfer, Load is the process you'll use to pull data out of your current tech stack or existing storage solutions and put it into your warehouse. You should pay careful attention to the ETL solution that you use.

Since, ETL is responsible for the bulk of the in-between work, choosing a subpar or developing a poor ETL process can break your entire warehouse. You want optimal speeds, good visualization, and the ability to build easy, replicable, and consistent data pipelines between all of your existing architecture and your new warehouse.

This is where ETL tools like Xplenty are valuable. Xplenty creates hyper-visualized data pipelines between all of your valuable tech architecture while cleaning and nominalizing that data for compliance and ease-of-use.

Remember, a good ETL process can be the difference between a slow, painful-to-use data warehouse and a simple, functional warehouse that's valuable throughout every layer of your organization.

For most businesses, ETL will be your go-to for pulling data from systems into your warehouse. It's counterpart Extract, Load, Transfer (ELT), will negatively impact the performance of most custom-built warehouses since data is loaded directly into the warehouse before data cleansing and organization occur.

5. Online Analytic Processing (OLAP) Cube

You will likely need to address OLAP cubes if you're designing your entire database from scratch, or if you have to maintain your own OLAP cube — which typically requires specialized personnel.

So, if you plan on using a vendor warehouse solution (e.g., Redshift or BigQuery) you probably won't need to utilize an OLAP cube (cubes are rarely used in either of those solutions*.)

*note: there are some vendor solutions that will let you build OLAP cubes on top of Redshift or BigQuery data marts, but we can't recommend any since we've never used them personally.

If you have a set of BI tools that require you to utilize an OLAP cube for ad-hoc reporting, you may need to develop one or use a vendor solution.

OLAP Cubes vs. Data Warehouse

A data warehouse is where you're storing your business data in an easily analyzable format to be used for a variety of business needs.

Online Analytic Processing Cubes help you analyze the data in your data warehouse or data mart. Most of the time, OLAP cubes are used for reporting, but they have plenty of other use cases.

Since your data warehouse will have data coming in from multiple data pipelines, OLAP cubes help you organize all of that data in a multi-dimensional format that makes analyzing it rapid and straightforward.

You may require custom-built OLAP cubes or you may need to hire support to help you maintain your cubes.

Here are some resources on OLAP cubes that will help you dig deeper.

6. Creating the Front End

So far, we've only covered backend processes. There needs to be front end visualization, so users can immediately understand and apply the results of data queries.

That's the job of your front-end. There are plenty of tools on the market that help with visualization. BI tools like Tableau or PowerBI for those using BigQuery are great for visualization. You can also develop a custom solution — though that's a significant undertaking.

Most small-to-medium-sized businesses lean on established BI kits like those mentioned above. But, some business may need to develop their own BI tools to meet ad-hoc analytic needs. For example, a Sales Ops manager at a large company may need a specific BI tool for territory strategies. This tool may need to be custom developed given the scope of their sales objectives.

You should pay keen attention to reporting during this stage. How often does reporting need to be done? Do you need each person to create their own reports? Questions like these should help guide you to a BI toolkit that fits within your unique requirements.

Big Tip: Keep it simple. Your employees don't care about most of the fancy features or deep complexities. They just want something that works for them and makes their lives easier.

7. Optimizing Queries

Optimizing your queries is a complex process that's hyper-unique to your specific needs. But, there are some general rules-of-thumb to cover.

We heavily recommend the following:

  • Ensure that your production, testing, and development environment have mirrored resources. This will prevent the server from hanging when you push projects from one environment to the next.
  • Try to minimize data retrieval. Don't run SELECT on the whole database if you only need a column of results. Instead, run your SELECT query by targeting specific columns. This is especially important if you're paying for your query power separately.
  • Understand the limitations of your OLAP vendor. BigQuery uses a hybrid SQL language, and RedShift is built on top of a Postgre fork. Knowing the little nuances baked into your vendor can help you maximize workflows and speed up queries.

8. Establishing a Rollout

Once you're ready to launch your warehouse, it's time to start thinking about education, training, and use cases. Most of the time, it will be a week-or-two before your end-users start seeing any functionality from that warehouse (at least at-scale). But, they should be adequately trained before the rollout is completed.

Congratulations! You're ready to design a data warehouse!

Final Thoughts

These are the core components of warehouse design. But, remember, your business may have different steps that aren't included in this list. Every data warehouse is different. This should help you understand some of the base-level requirements and steps towards creating a functional data warehouse that delivers tangible value at every twist and turn of your business.

You might also like our other posts...