What is Data Masking?

Data masking is the process of obfuscating sensitive data in a way that, when the data is exported for testing purposes, allows accurate testing without exposing private information.

How is Data Masking Performed?

There are many common data masking techniques, which can be used depending on the nature of the data and the scope of the testing. These methods include:

  • Nulling: Data values are returned as blank or replaced with placeholder characters.
  • Anagramming: The order of the characters or digits are shuffled for each entry. For example, "Laura" and 7189 might be shuffled to "Raalu" and 8917.
  • Substitution: In this approach, each value is replaced with a random selection from a separate database of appropriate substitute values. For example, the administrator may compile a list of non-functional credit card numbers. These can then be swapped in for real credit card numbers during the masking process.
  • Stochastic substitution: This method looks at the variance between values in the field and produces a random value within that range. For example, if the values are dates that fall within a six-month period, the masking algorithm will create a set of appropriately distributed random dates within that same six-month period.
  • Encryption: Sensitive data is encrypted when exported. Anyone with the password or key can then unencrypt it.

It's up to the database administrator to decide which method to apply, in order to best balance data integrity with data security. Each of these methods can be applied statically or dynamically:

  • Static data masking: Masking rules are applied at source. As the original copy is masked, there is a guarantee that sensitive data can't be exposed. However, this data cannot be used for any purpose which requires unmasked data.
  • Dynamic data masking: Masking is applied to outgoing exports according to pre-defined data rules. These rules can be based on factors such as user access level, API call arguments, or anything else that might require additional data security. Different types of masking rules can be applied so that each scenario returns the most appropriate set of data.

Dynamic masking is more flexible and therefore more suited to a continuous testing environment such as DevOps. However, static masking is faster and ultimately more reliable.

Why is Data Masking Used?

Data is the lifeblood of any organization. It powers applications, enables analytics, and influences strategy. But data also represent real human beings – customers, clients, employees, site visitors, mailing list subscribers, citizens, and other people who interact with an organization.

Masking data allows organizations to balance all of these needs.

Customer Trust

When personal data is exposed, there is a risk that it will fall into the wrong hands. A hacker could intercept the data transmission, or an unethical employee could steal the personal information at their disposal.

Data masking solves this problem by ensuring that any sensitive data is either removed or encrypted before it arrives at its destination. This contributes to building customer trust and avoids the reputational damage associated with privacy breaches.

Accurate Testing

Test data doesn't have the depth or range of production data, which is why production data is essential for testing. If developers are denied access to high-quality data, they won't be able to develop or maintain products.

Data masking solves this issue by providing a dataset that closely resembles the real thing, without endangering anyone's privacy. Developers can test their applications in a real-world scenario and identify any issues ahead of release.

Compliance

The EU's General Data Protection Regulation (GDPR) requires data controllers to implement pseudonymization so that data cannot be used to identify an individual. Data masking is one of the key pseudonymization techniques used by enterprise.

It's possible that other territories may introduce similar requirements. If so, then organizations will need to demonstrate that they use data masking where appropriate, or they may face penalties for failing to meet data protection requirements.  

Common Data Masking Problems

Data masking needs to be applied with care, or it can have negative consequences, such as:

Testing Integrity

Data masking can impact the accuracy of testing. For example, an organization may have some customers with special characters in their surnames. Using the substitution method, each surname will be replaced by a dummy surname on the substitution table.

If, however, none of the substitute surnames contain a special character, then the test results won't be accurate. Such inconsistencies may not be revealed until after the application goes live.

Reversibility

If the masking method is too transparent, then it may be possible to extrapolate the original values. For example, if a common first name is anagrammed, then it is usually easy to unscramble. The same applies to substation codes, such as changing A to B, B to C, and so on.

Masking is only effective when it is impossible to determine the original values. Most administrators will perform a sense check to see if their masking algorithm is up to standard.

Database Integrity

In some relational database configurations, personal data may act as a primary key. For example, an employee database might use the employee's id number as a key. If the table is joined to another by the primary key, then masking will break the relationship.

The solution is to use a different primary key to preserve the relationship, ideally something that cannot be used to identify the data's subject. Database administrators will generally check the integrity of the database after masking to ensure that all relationships remain intact.

Analytics

If the masked data is being used for analytics, then the resulting insights may not be accurate. For example, if a column of dates has been replaced with random date values, then analysts won't get a clear picture of daily activity patterns.

This problem often arises when there is poor communication between the database team and the analytics team. Effective masking requires active dialog on both sides to ensure that business needs are being fulfilled while customer data remains safe.