Deduplication is a method of removing duplicate values from data. This process helps speed up processes such as backups, and other processes that may result in large-scale repetition of data values.
Why is Deduplication Important?
Data has a monetary cost for its owner. There are storage costs for holding data, and then there are processing costs for querying data. As data volumes expand, data costs increase.
Duplicated data has no value for the owner, yet it still costs money. In some circumstances, duplicated data can start to affect performance by slowing down query results.
Data duplication can happen on a large scale in some processes, such as data backups. For example, consider an organization that has a production system such as a CRM. For this company, an average of 1% of their customer details changes each day, while the other 99% remain the same.
The organization will likely choose to back up the CRM data regularly, perhaps every day. If the size of the export is 1 GB, then the organization will end up with 365 GB of backup data after a year. If 99% of each backup is unchanged, then up to 361 GB of that storage is wasted.
As these redundancies scale up, they result in higher storage costs and slower query results. Deduplication is the process of removing these redundancies so that the organization has reliable backups with as little repetition as possible.
How is Deduplication Performed?
There are several ways of performing data deduplication, depending on the nature of the task.
Within a relational database, individual rows may contain repeating values. These can be removed using a query or a script, as long as they are truly repeating.
Organizations generally perform this kind of deduplication manually, through a stored query, or with a batch file. This kind of deduplication can also happen as part of the post-processing, which is the cleansing process a data owner performs after data has moved from the target source to its destination. Query-based deduplication is generally for fine-tuning a database, rather than make large-scale efficiency improvements.
Deduplication is one of the functions of the transformation process in ETL (Extract, Transform, Load). The ETL process holds data in a staging layer after import. The process then compares the staging layer data to other available sources.
If the process detects a duplicate, it will take one of the following actions:
- Removal: The ETL process deletes the duplicate value. It then passes the deduplicated version of the data to the destination repository.
- Tokenization: A token replaces the duplicate value. This token points towards the corresponding value in the existing data. In the example above, this would involve identifying customer records that haven’t changed since the last backup, and then replacing that record with a pointer to the appropriate entry in the destination repository.
- Normalization: If a relational database contains a lot of duplicate cell values, it may trigger a normalization process within the ETL platform. Normalization takes several forms, depending on the nature of the redundancy. Generally, it involves restructuring the tables to a more efficient form.
In ETL, this is all rules-based, with rules defined by the user. Users can choose to apply these rules as they see fit, depending on their data requirements.
When working with large exports or with unstructured data, deduplication generally involves a direct comparison of the import and existing files. This then uses a deduplication process, which will perform the following steps:
- Break the files down into smaller sections. If the process tries to compare two 500 GB files with a single byte different, they will show as non-duplicates. Instead, the process must break the data down into more manageable portions.
- Create a signature for each section. The individual sections are not readable as files in themselves, so the process creates a hex signature based on the values contained in each section.
- Compare signatures between sections. The process checks to see if the signature of the target section exists anywhere within the destination.
- Tokenize if it is a duplicate. If there is a match, the process replaces this duplicate section with a token value. The token points towards the location of the matching file section.
- Write if not a duplicate. If there is no match, then the target section is copied to the destination.
The result of this process is a much smaller file transfer. The process only replicates the unique sections of the import file. The rest of the file contains pointers, which are small notes that tell the destination system where to find the missing sections.
What are the Potential Issues of Deduplication?
Deduplication processes must be transparent, documented, and carefully planned. If there is an error in deduplication, it could have negative results, such as:
- Accidental deletion: The biggest danger of a poor deduplication process is that you could remove unique data. You can avoid this by using a responsive platform such as an ETL tool to manage the process.
- Lack of redundancy: A classic example of deduplication is email backups. Emails might all have the same footer image, which means that this image only needs to be backed up once, and each email can point to the backup file. However, if anything happens this one copy of this image, every stored email will be affected.
- Increased overheads: The goal of deduplication is to reduce processing overheads. But there is an overhead to this process too. The organization has to balance that cost with the cost of duplicate data. Overheads can arise if the deduplication process fails to reduce file sizes, or if the processing cost of reconstructing a file is too high.
Most of these issues are avoidable with careful planning and the use of the right tools.