What is Data Quality?

Data quality refers to the current condition of data and whether it is suitable for a specific business purpose. Data quality management is the act of ensuring suitable data quality. It is one of the central pillars of a data governance framework.  

How is Data Quality Measured?

Different contexts require different standards of data quality. For example, when working with production databases, data quality implies a high standard of cleansing, integration, and harmonization. In the context of a data lake, data quality might only refer to the removal of corrupt and blank data values. 

Each organization will devise its own framework for data quality policy. This policy will specify details such as:

  • Data purpose: The data quality policy will outline the current use for data, and potential future uses. 
  • Data sources: The policy will detail the nature of any available data repositories and outline the requirements for suitable future repositories. 
  • Data transformation methods: Data quality is impacted by transformation processes, such as cleansing, augmentation, harmonization, and integration. The data quality policy should provide an outline of the acceptable standards for any kind of data transformation.
  • Auditing practices: The data quality policy should outline the preferred methods for auditing the quality of data. As part of data governance, each organization should have a clear process for responding to any data quality issues that arise during an inspection. 

Data quality can involve a trade-off between speed and efficiency, which is why organizations may have different policies for different contexts. However, the data quality policy should always fall in line with other elements of the data governance framework, especially those related to data security.

What are the Attributes of Data Quality? 

There is no universally recognized standard of data quality. However, there are some frameworks that can help organizations to develop their own definition of quality. 

DAMA, the Global Data Management Community, outlines some common attributes of data quality:

1. Validity

Data values need to fit within the data schema. At the most basic level, this means that values should match the data type: if the schema specifies an integer, the value must be numeric, and so on. 

There are also logical rules that might not be reflected in the database. For example, in a list of customers' dates of birth, all values should be in a date format, such as YYYYMMDD. But also, this date can't be in the future or the distant past.

Definitions of valid data can vary between departments. Someone who only makes local calls might say that phone numbers must be nine digits. Someone who dials internationally may disagree. It's important to clarify these standards across the organization. 

2. Accuracy

All data describes something, whether it's people, products, or other data. Quality data should give an accurate representation of the thing it describes. 

Accuracy can be a somewhat subjective term. For example, if two discrete databases contain conflicting data values, only one can be accurate (unless both are inaccurate). Resolving this kind of discrepancy is a matter of data governance and understanding the nature of the sources. 

The measure of accuracy is how well the data describes something. For example, if a business has an accurate set of customer data, then employees will be able to view the right name, address, contact details, and order history for each person. 

3. Timeliness

Available data should contain the most recent available values. Expired data should be flagged, hidden, or expunged.

Timeliness is especially important in production systems. For example, an ERP always shows the most recent attributes of all resources. If the ERP dashboard shows an old delivery status or shows an employee as available when they're on vacation, then resource planning will fail. 

Under rules like GDPR, data must be kept up to date, where necessary. This rule is intended to prevent situations where, for instance, a company sends sensitive mail to a customer's previous address.  

4. Completeness

The available data should be whole and comprehensive. Gaps in the data may lead to inaccurate analysis or invalidate the usefulness of other data. 

It's not always possible to tell if data is complete or partial. For example, a website owner may keep two analytics databases: one for desktop visitors and one for mobile browsers. When viewed in isolation, either of these databases may appear to describe all website visitors. This can result in skewed and inaccurate analytics. 

Relational databases are built on relationships, so excluding any data may impact the functionality of the rest of the database. Data from other sources might be required to provide additional context.  

5. Reliability

Data should come from a reliable source. If there is doubt about the source's accuracy, all data should be tagged appropriately to distinguish it from more reliable data. 

Reliable in this sense means data that has passed through an approved transformation process. For example, data that has passed through an Extract, Transform, Load (ETL) process is generally considered reliable as it has been processed according to a recognized schema. The ETL output will be thoroughly cleansed and harmonized. 

Reliability can also relate to the data itself. For instance, a survey conducted among logged-in users may be more reliable than a public poll. When assessing reliability, it's important to know the methodology behind any data acquisition. 

6. Granularity

Data values can be aggregated or summarized when required. Where the business needs detailed data, the data should be suitably granular. 

Data values can often be broken down into finer values. Consider a customer's purchasing history, for example. In some reports, this may be presented as a single value: total lifetime spend. But this can be broken down further, into a list of all invoice totals. The invoices can be broken down into line items, and line items can be broken down into item cost and sales tax. Each of these is a different level of granularity. 

Highly granular data may not be required in all instances. As with all elements of data quality policy, the deciding factor is the ultimate business purpose of the data.