When there are multiple instances of the same data value, we call this "data redundancy." Data values should only appear within a database as many times as necessary. Ideally, each unique value should only appear once.
Data redundancy can lead to wasted resources and slower query times. As well as that, it can actually cause dangerous inconsistencies and even data loss. Database developers work hard to chase redundancy out of their systems with a process known as "normalization."
How Does Data Redundancy Happen?
Redundancy can occur within a database that lacks proper structure. For example, consider a database that contains two tables: one for customer information and one for order information.
When a customer creates an account, the system creates a record in the customer information table. This record includes all of their information, including a home address. Later, the customer orders a product for home delivery. Because it is in a poorly designed database, the order information table also has an address column.
Here, we see data redundancy: the same address exists twice. It’s wasteful and inefficient, but it’s also a risk. What happens if the customer logs in and changes their address? The system may only register the change on one table, so you now have two addresses for the customer, one of which is incorrect.
This kind of problem can happen on a larger scale within an organization. For instance, most businesses will have a Customer Relationship Management system and a separate order fulfillment system. Out of necessity, both platforms will have their own databases. Each database may have information about an entity, such as a customer. Businesses have to keep these entities in sync.
What’s the Solution to Data Redundancy in Databases?
Database designers use normalization to avoid any kind of structural redundancy within their databases.
In the previous example, the designer would immediately see that the address value occurs in two tables. They would deal with this problem by adding a new table called "Address" that looks like this:
ID ¦ Street ¦ City ¦ State ¦ ZIP -------------------------------------------------- 001 ¦ Main Street ¦ Los Angeles ¦ CA ¦ 90210
The designer would then remove the address field from the Customer and Address tables, and replace this with the corresponding ID from the Address table.
What happens if the customer changes their address? The system updates the value in the Address table, which changes what you see when you’re looking at the Customer view and the Order view.
Normalization is an iterative process. This means that you perform some basic normalization, then look for further normalization opportunities. A fully normalized database will have zero redundancy.
Database normalization is a fundamental part of database design. Generally, users won’t have a problem with this unless they’re working with data from a source like Excel. But problems may occur when working with information on a larger scale.
What’s the Solution to Data Redundancy in Organizations?
Normalization techniques only work on unified databases. So, what happens when your data is scattered across multiple systems?
This is a common problem in enterprises. Some companies fail to recognize it as a problem thanks to the affordability of data lakes. However, if you dump everything into a data lake without discrimination, you risk turning it into a data swamp.
A better approach is to use data integration to merge and normalize your data. Data integration often relies on a process like ETL to help unify your data. ETL is a three-step process:
- Extract: First, the ETL pulls data from each source in a raw format.
- Transform: Next, the ETL applies transformation techniques to the raw data. This can involve data cleansing and deduplication, which helps to purge any unnecessary records. The ETL will also apply a new schema, which can include full normalization.
- Load: Finally, the ETL sends the transformed data to its destination, which is typically a data warehouse.
Data warehouses are relational databases, although they’re typically much larger than other databases. But database experts can still apply the same normalization techniques to help create an even greater level of data consistency. This will speed up queries to the warehouse while reducing overall warehousing costs.
Advantages of Intentional Data Redundancy
Data redundancy is not always negative. In fact, redundancy can actually be an important element of data governance.
For instance, a data backup is technically a form of redundancy, as it contains data that exists elsewhere. But having multiple backups is essential for data quality. Ideally, organizations should take full snapshots of their data regularly.
Data redundancy can also support data integrity. For example, having a customer address on multiple systems makes sure you can verify one against the others whenever there is a question. System architects might also sometimes decide that it’s worth putting up with redundancy to improve overall system performance.
Intentional data redundancy can be positive, as long as it is an informed decision with an understanding of the possible consequences. Database designers should always consider data integrity ahead of any other considerations.