What is Data Obfuscation?

Data obfuscation describes any process that hides sensitive data while retaining certain aspects of usability. The terms "data obfuscation" and "data masking" are often used interchangeably, although data masking can also refer to obfuscation specifically for testing purposes.

Why Use Data Obfuscation?

Organizations often need to hide data from unauthorized access, especially business-critical data or personal information. This can be for data security reasons or because of a compliance requirement related to data protection.

If the sensitive data is not essential for processing, it can simply be removed or nulled. Where there is a need for a full data set, then obfuscation is used to preserve privacy. This need might exist for several reasons, such as:

  • Testing: Accurate testing is only possible with production data. Data obfuscation produces a database that is fully representative of real data but contains no sensitive information. 
  • Secure transactions: Two systems may need to perform a transaction without exposing data, such as an e-commerce server connecting to a secure payment system. Obfuscation allows this without revealing data such as credit card numbers. 
  • Data exports: When data is moved from one system to another via a manual export-import process, the contents of the data file may be vulnerable. Obfuscation can hide critical data, making it unreadable if the file is intercepted. 

What is the Process of Data Obfuscation? 

There are many ways of obscuring data while preserving functionality. A few of the most common include: 

Data Anonymization

Data anonymization is often used for producing secure, usable test data. There are several different methods of masking data including: 

  • Randomization: Data values are scrambled before being shared. This can be done by anagramming data, or by randomly shuffling columns so that each row is inaccurate. 
  • Substitution: Dummy values replace real data values. These can be randomly generated or taken from a lookup table. For example, a real credit card number can be replaced by a fake credit card number, which is obtained from a list of non-active credit cards. 
  • Ranged substitutions: Dummy values are used, but these values fall within the range of the actual data values. For example, with a list of numbers, there will be the highest value and the lowest value. The dummy values are randomly generated, but they fall evenly within the limits of this range. 

Anonymized data looks just like real data, and can be used for thorough software testing. However, it does not contain any identifiable information. Ideally, there should be no way to reverse the anonymization process and obtain the original data. 

Data Tokenization

With tokenization, each data value is linked to a random code, or token. This token has no value in itself, but when it is passed back to the original system, it can be used to perform a lookup.

For example, a database might contain a list of credit card numbers. Each credit card is linked to random tokens in a lookup table. A secure payment API could use the token when interacting with other systems, which means that the credit card number is never exposed. 

Data Encryption

Encrypted data is transformed using an encryption algorithm and can be unlocked by anyone with the key. Encrypted data is unreadable while in transit – often, it will appear as a string of alphanumeric nonsense. 

Encryption allows sensitive data to travel alongside other data safely. So, a data export might contain some encrypted tables that can’t be accessed until they reach the destination. Once it arrives, the recipient can use the key to restore the original data values. 

Share This Article
facebook linkedin twitter

Glossary of Terms

A guide to the nomenclature of data integration technology.