Star schemas offer an efficient way to organize information in a data warehouse. Snowflake schemas are a variation of star schemas that allow for more efficient data processing. Both schemas improve the speed and simplicity of read queries and complex data analysis—especially when dealing with large data sets that pull information from diverse sources.

Despite their similarities, star and snowflake schemas have key differences that every data scientist and data engineer needs to understand. For a simple distillation that answers the question "How Are Snowflake Schemas and Star Schemas Different," we'll begin with an in-depth discussion of star schemas. Then, we'll move into snowflake schemas and explore what makes them unique.

Table of Contents

What's a Star Schema?
(1) Types of Fact Tables
(2) Types of Dimension Tables
(3) How Fact Tables and Dimension Tables Work Together
(4) Star Schema Diagram
(5) Denormalization of Data in Star Schemas
(6) Benefits of Star Schemas
(7) Challenges of Star Schemas

What’s a Snowflake Schema?
(1) Snowflake Schema Diagram
(2) Benefits of Snowflake Schemas
(3) Challenges of Snowflake Schemas

What's a Star Schema?

Star schemas offer the simplest structure for organizing data into a data warehouse. The center of a star schema consists of one or multiple “fact tables” that index a series of “dimension tables.” To understand star schemas—and for that matter snowflake schemas—it's important to look at fact tables and dimension tables in depth.

The purpose of a star schema is to cull out numerical "fact" data relating to a business, and separate it from the descriptive, or “dimensional" data. Fact data will include information like price, weight, speed, and quantities—i.e., data in a numerical format. Dimensional data will include uncountable things like colors, model names, geographical locations, employee names, salesperson names, etc., along with the numerical information.

The fact data is organized into fact tables, and the dimensional data is organized into dimension tables. Fact tables are the points of integration at the center of the star schema in the data warehouse. They allow machine learning tools to analyze the data as a single unit, and they allow other business systems to access the data together. Dimension tables hold and manage the data—numerical and nonnumerical—which converges through fact tables that make up the data warehouse.

From a more technical perspective, fact tables keep track of numerical information related to different events. For example, they might include numeric values along with foreign keys that map to additional (descriptive and nonnumerical) information in the dimension tables. Getting even more technical, fact tables maintain a low level of granularity (or “detail”), which is to say, they record information at a more atomic level. This could lead to the buildup of many records within the fact table over time.

(1) Types of Fact Tables

There are three main kinds of fact tables:

  • Transaction fact tables: These record information related to events, like individual merchandise sales.
  • Snapshot fact tables: These record information that applies to specific moments in time, like year-end account statements.
  • Accumulating snapshot tables: These record information related to a running tally of data, like year-to-date sales figures for specific merchandise or categories of merchandise.

(2) Types of Dimension Tables

Dimension tables normally store fewer records than fact tables, however—in addition to storing numerical data—the records in dimension tables also include descriptive attributes. There are many types of dimension tables depending on the information system. Here are some examples:

  • Time dimension tables: Information to identify the exact time, date, month, year different events happened.
  • Geography dimension tables: Address/location information.
  • Employee dimension tables: Information about employees and salespeople, such as addresses, phone numbers, names, employee numbers, and email addresses.
  • Merchandise dimension tables: Descriptive information about products, their product numbers, etc.
  • Customer dimension tables: Customer name, numbers, contact information, addresses, etc.
  • Range dimension tables: Information relating to a range of values for time, price and other quantities.

(3) How Fact Tables and Dimension Tables Work Together

Dimension tables usually list a surrogate primary key (i.e., a data type that consists of a single-column integer) that maps to attributes related to the natural key. Imagine you have a dimension table with information relating to different stores: "Dim_Store" (see Star Schema illustration below). You can assign an ID number to each store and its row of related nonnumerical and other information—like store name, size, location, number of employees, category, etc. As it follows, wherever you list the Store ID number on the fact table ("Fact_Sales"), it will map to that specific row of store data on the "Dim_Store" dimension table.

Of course, the star schema doesn't stop there—because there are additional points (or dimension tables) with information that links to the fact table. As an example, let's say you want to know the following for the time-period August 2019:

  • How many products were purchased?
  • What products were purchased?
  • In what stores were the products purchased?
  • What were the names and addresses of the products purchased?
  • What brand name manufactured the products purchased?
  • What day of the week did customers make each product purchased?

To conduct a query like this, you'll need to access data contained in all of the dimension tables (Dim_Date, Dim_Store, and Dim_Product). These are separate databases; however, through the fact table—which serves as a point of integration—you can query all of the data like it were contained in a single table. And that's how a star schema data warehouse works!

(4) Star Schema Diagram

The following diagram illustrates what a simple star schema looks like:

*Image by SqlPac at English Wikipedia, CC BY-SA 3.0.

Here, the fact table, Fact_Sales, is at the center of the diagram. Its schema includes the following columns for ID numbers: Date_Id, Store_Id, Product_Id, and Units_Sold. As the point of integration, the fact table integrates the diverse information in the dimension tables: Dim_Product, Dim_Store, and Dim_Date.

As you can see, the star schema gets its name from having a central fact table “core,” and dimension table “points.” When a star schema has many dimension tables, data engineers might refer to it as a centipede schema.

(5) Denormalization of Data in Star Schemas

The star schema’s goal is to speed up read queries and analysis for massive amounts of data contained in diverse databases with different source schemas. The star schema achieves this goal through the “denormalization” of the data within the network of dimension tables.

Traditionally, database managers sought the “normalization” of data by eliminating duplicate copies of the same data, which is to say, the normalization of the duplicate information into one copy. This made write commands faster because only one copy of the data needed updating.

When a data system expands into multiple dimension tables, however, accessing and analyzing data from multiple sources slows down read queries and analysis. To speed things up, the star schema relaxes the traditional rules of database normalization by “denormalizing” the data.

A star schema pulls the fact data (or ID number primary keys) from the dimension tables, duplicates this information, and stores it in the fact table. In that way, the fact table connects all of the information sources together. This makes read queries and analysis infinitely faster. However, it sacrifices the speed of write commands. The slower write commands happen because the system needs to update all counterpart copies of the “denormalized” data following each update.

(6) Benefits of Star Schemas

Star schemas offer the following benefits:

  • Queries are simpler: Because all of the data connects through the fact table the multiple dimension tables are treated as one large table of information, and that makes queries simpler and easier to perform.
  • Easier business insights reporting: Star schemas simplify the process of pulling business reports like as-of-as and period-over-period reports.
  • Better-performing queries: By removing the bottlenecks of a highly normalized schema, query speed increases, and the performance of read-only commands improves.
  • Provides data to OLAP systems: OLAP (Online Analytical Processing) systems can use star schemas to build OLAP cubes.

(7) Challenges of Star Schemas

As mentioned before, improving read queries and analysis in a star schema could involve certain challenges:

  • Decreased data integrity: Because of the denormalized data structure, star schemas do not enforce data integrity very well. Although star schemas use countermeasures to prevent anomalies from developing, a simple insert or update command can still cause data incongruities.
  • Less capable of handling diverse and complex queries: Databases designers build and optimize star schemas for specific analytical needs. As denormalized data sets, they work best with a relatively narrow set of simple queries. Comparatively, a normalized schema permits a far wider variety of more complex analytical queries.
  • No Many-to-Many Relationships: Because they offer a simple dimension schema, star schemas don’t work well for “many-to-many data relationships.”

What’s a Snowflake Schema?

Now that you understand how star schemas work, you’re ready to explore the snowflake schema—which takes the shape of a snowflake. The purpose of a snowflake schema is to normalize the denormalized data in a star schema. This solves the write command slow-downs and other problems typically associated with “star schemas.”

The snowflake schema is a “multi-dimensional” structure. At its core are fact tables that connect the information found in the dimension tables, which radiate outward like in the star schema. The difference is that the dimension tables in the snowflake schema divide themselves into more than one table. That creates the snowflake pattern.

Through this “snowflaking” method, the snowflake schema normalizes the dimension tables it connects with by (1) getting rid of “low cardinality” attributes (that appear multiple times in the parent table); and (2) turning the dimension tables into more than one table, until the dimension tables are completely normalized.

Like snowflake patterns in nature, the snowflake database becomes exceedingly complex. The schema can produce elaborate data relationships, where child tables have more than one parent table.

Vertabelo offers an excellent comparison of snowflake schemas versus star schemas:

Unlike the star schema, dimension tables in the snowflake schema can have their own categories. The ruling idea behind the snowflake schema is that dimension tables are completely normalized. Each dimension table can be described by one or more lookup tables. Each lookup table can be described by one or more additional lookup tables. This is repeated until the model is fully normalized. The process of normalizing star schema dimension tables is called snowflaking.

(1) Snowflake Schema Diagram

Before we go too deep into the snowflaking concept, it's time to look at an illustration of a snowflake schema:

*Image by SqlPac at English Wikipedia, CC BY-SA 3.0.

Do you see how the above illustration took the star table example, and "snowflaked" each dimension table outward? Let's examine the Dim_Product dimension table. What has happened is that various columns of the Dim_Product table have snowflaked outward into lookup tables.

In the star schema example, Dim_Product included the nonnumerical names of the brands. Now it just includes the Brand_Id number which points to the Dim_Brand lookup table. By translating the Dim_Product table into a numerical value like this, we increase the speed at which the system can process queries. More importantly, we reduce the amount of space required to store data. That's because the Dim_Product table no longer includes multiple entries of the full names of brands (which are long strings of data compared to the Brand_Id numbers).

Long story short, a number requires a dramatically-reduced amount of space—and it's faster to process—than a written name or qualitative descriptive value. Therefore, snowflaking the dimension tables out into lookup tables can save a lot on storage costs when dealing with millions of rows and columns of data.

(2) Benefits of Snowflake Schemas

Snowflake schemas offer the following benefits compared to normal star schemas:

  • Compatible with many OLAP database modeling tools: Certain OLAP database tools, which data scientists use for data analysis and modeling, are specifically designed to work snowflake data schemas.
  • Saves on data storage requirements: Normalizing the data that would typically be denormalized in a star schema can offer a tremendous reduction in disk space requirements. Essentially, this is because you're converting long strings of nonnumerical data (the information pertaining to descriptors and names) into numerical keys that are dramatically less taxing from a storage perspective.

(3) Challenges of Snowflake Schemas

There are three potential challenges relating to snowflake schemas:

  • Complex data schemas: As you might imagine, snowflake schemas create many levels of complexity while normalizing the attributes of a star schema. This complexity results in more complicated source query joins. In offering a more efficient way to store data, snowflake can result in performance declines while browsing these complex joins. Still, processing technology advancements have resulted in improved snowflake schema query performance in recent years, which is one of the reasons why snowflake schemas are rising in popularity.
  • Slower at processing cube data: In a snowflake schema, the complex joins result in slower cube data processing. The star schema is generally better for cube data processing.
  • Lower data integrity levels: Snowflake schemas offer greater normalization and fewer risks of data corruption after performing UPDATE and INSERT commands, they do not provide the level of transnational assurance that comes with a traditional, highly-normalized database structure. Therefore, when loading data into a snowflake schema, it's vital to be careful and double-check the quality of information post-loading.

Are You Worried About Integrating Data Into a Snowflake Database? Xplenty Can Help

At Xplenty, our automated ETL (extract, transform, load) tools help you seamlessly load data into complex snowflake data warehouses in real-time. With our solutions, you don't need a highly-skilled data team to perform complex ETL functions. That will save time and labor costs while integrating diverse information into your data warehouse, all while preventing data loss and corruption. Browse our website to learn more about Xplenty and the hundreds of easy, out-of-the-box data integrations our platform provides.