What is Data Cleansing and Why Does it Matter?

How clean your data is can impact your data warehousing ROI. Also, it influences the accuracy and reliability of your business analytics.  So, what is data cleansing, and why does it matter?

Data cleansing means changing or getting rid of incorrect, inconsistent, or corrupt data. It should be the first thing to do in preparing your data for analysis. Once the data is clean, it is ripe for storage and meaningful analysis.

Table of Contents
What's Computer-Assisted Data Cleansing?
Data Cleansing: Why Do It?

Steps:
1. Drop Irrelevant Data
2. Get Rid of Duplicate Data
3. Structural Errors/Discrepancies
4. Take Care of Outliers
5. Drop, Impute, or Flag Missing Data
6. Standardize the Data
7. Validate the Data

What's Computer-Assisted Data Cleansing?

Computer-assisted cleansing means using specialized software to correct errors in your data. The software works by comparing unclean data with accurate data in a database. It also checks manually-entered data against standardization rules. For example, it would change "california" to "California" when capitalizing the names of states.

Data cleansing software can make most of the necessary changes automatically, such as fixing typos. For this to work, you would need to set an auto-correction threshold score, such as 0.8 or 80%. For example, the tool would automatically change any misspelled product name if it found a correct match that surpassed the 0.8 score.

The data cleansing process is also interactive. That would be helpful if the software couldn't find a matching replacement satisfying a preset auto-correction rule. You would have to set an auto-suggestion threshold value, such as 0.6, for this to work. Based on the example above, if the software encountered a misspelled product name, it would suggest a replacement if it found a match above the 0.6 score. You would then look at the suggested replacement, and approve or decline it as appropriate.

Computer-assisted data cleansing is much more accurate than a human-centric process. Plus, it is fast when cleansing large volumes of streaming data.

Data Cleansing: Why Do It?

According to one survey by Experian, most companies believe 29% of their data is defective. You may have the same problem if you're ingesting tons of data from diverse sources. It's almost certain that some of this data will be streaming in "dirty." For example, social media comments or text on images may not always meet your formatting or accuracy standards.

You may also receive unclean data from a structured source, such as a relational database. An example is when a value in a foreign key column doesn't match the referenced primary key. In this scenario, data cleansing can correct the error.

Consuming raw data "as is" can lead to poor analysis and wrong decisions. And it can hurt a company's bottom line too. According to Forbes, dirty data is costing business organizations up to 12% of total revenue. So, data cleansing is necessary for companies that collect volumes of data in their data warehouses.

Cleaning your enterprise data can fix these major issues:

  • Duplication
  • Irrelevance
  • Inaccuracy
  • Inconsistency
  • Incompleteness (missing data)
  • Outliers
  • Lack of standardization

The direct benefits of data cleansing include:

  • Improves the accuracy and reliability of data analytics
  • It enhances the efficiency of machine learning models
  • Clean data can inform better decision making

We, thus, recommend these key measures/steps for data cleansing:

1. Drop Irrelevant Data

Identify and get rid of irrelevant data in your database or data warehouse. That would be anything whose processing or analysis has no business value to you. Such data can skew the outcomes of your business analytics.

A good example is building a predictive maintenance model for a specific car brand. In that scenario, you wouldn't need data on all car models to train your ML algorithm. So, you would drop rows containing irrelevant vehicle entries.

You would do something similar if you were studying the general health of the population. Your tables wouldn't need a "phone number" column in that case.

However, be sure the feature you want to remove is not essential to the dataset you're analyzing. Consult with other stakeholders to make sure you don't drop data you might need one day.

2. Get Rid of Duplicate Data

Duplicate data uses server or processing resources without adding value. Additionally, duplicate records may skew your insights into your customers. Thus, removing duplicate data from your warehouse is a key part of the data cleansing process.

Record duplication may result from various scenarios. An example is when you're ingesting data from multiple sources. For instance, spreadsheet files may have the same customer information as CRM records. Scraping the same web page twice can cause this problem too.

Consider the case of a customer that submits two different email addresses via two separate forms. There's a high probability of treating this information as belonging to two different customers. Data cleansing would help to spot any such duplication. You would then merge the records or remove one of them.

Any two records with the same unique ID in a database table or file are duplicates. Data cleansing merges such records to eliminate redundancies. Merging rules may differ from organization to organization.  

3. Structural Errors/Discrepancies

There are different types of structural errors, from typos to inconsistent capitalization. These can be a problem in categorical data or grouped datasets, so they need cleansing.

Typos may originate from the different ways in which people type strings. Consider the example of "Gender" as a categorical variable. There usually should be two classes in this case: male and female. But you may encounter more than two different categories of the variable in a dataset. For instance:

  • M
  • Male
  • Fem.
  • F

Data cleansing helps to recognize such mislabeled or inconsistently capitalized classes. The outcome in the "Gender" example would be cleaner classification of grouped datasets (i.e., Male and Female).

Consider using bar plots to spot structural or capitalization errors in your datasets. These graphs show data categories along one axis and their corresponding values or characteristics on the other. By just looking, you may see that some strings mean the same thing despite their different spelling or capitalization. For example, "asphalt" and "Asphalt" may appear separately on a bar plot despite being the same type of roofing material.

One way to fix typos is to map each given value to the expected class options, manually. In the "Gender" example, you can map each string to either "Male" or "Female."

You may also run a fuzzy matching algorithm on strings you suspect to have typos. The tool works out the "edit distance" between the anticipated strings and each of the values in the specific dataset. The edit distance is the number of changes, such as insertions, replacements, or deletions, it takes to transform a source string into the intended one. It serves as a measure of similarity between any two strings. For example, it requires three transformations (i.e. writing "a", "l", and "e" after "m") to change "m" to "male."

Take the example of the source string "landan" and the target string "London." If the similarity score (edit distance) between the two strings were higher than a preset threshold, your fuzzy matching program would match "landan" to "London." You would correct to "London" all strings that satisfied the similarity index.    

4. Take Care of Outliers

Any value that substantially differs from the rest of your data may be an outlier. However, not all outliers are bad for business analytics. So, before removing any deviant values, assess their potential impact on your analysis.

Take the example of analyzing the various states in the United States. In this scenario, California would be an outlier because of its significantly larger population and economy. Nonetheless, excluding the state would substantially affect the results of your analysis.  With that said, outliers impact some analytical models, such as linear regression, more than others.

The two main types of statistical outliers are:

  • Univariate outlier: With this data point, only one variable has an extreme value. Excluding this type of outlier in your analysis may lead to unrealistic conclusions.
  • Multivariate outlier: This type has a combination of abnormal scores on two or more variables.

5. Drop, Impute, or Flag Missing Data

Discretion is necessary when you're deciding whether to drop, impute, or flag missing data. What you do with the missing pieces of information affects the accuracy of your analytics.

Imputing

Imputing means working out the missing value based on the other data. Imputation techniques include linear regression and calculating values like median and mean. You may also copy values from similar observations across your dataset.

Nonetheless, imputing values increases the risk of using biased data in your analysis. That can be a major concern when there are many missing values.   

Similarly, imputation does not enrich your dataset in any way. It only reinforces a pattern that the other observations have already established.

Dropping

You may drop observations that have missing values when analyzing statistical data. In this case, dropping is better than imputing values that can impact computation results.

Flagging

Flagging means telling your ML algorithms about any missing values. The main argument for this approach is that it prevents the loss of information. In other words, it's essential that your analytical models decode that specific values are missing. Flagging is particularly useful when data is missing consistently, rather than randomly.

You may fill missing numeric data with "0." However, statistical calculations should ignore these zeros.

In the case of categorical observations, flag any missing values with "Missing." Here, you add a new category so your algorithm can learn what values are unavailable. 

6. Standardize the Data

Cleansing your data includes standardizing it to have a uniform format for each value. You may start by having all strings in the same case (upper or lower).

Metric conversion may be necessary when standardizing measurements. For example, all values for height should be in the same unit. So, you may need to convert from feet to meters (or vice versa) to achieve parity. Likewise, use either dollars or cents across your dataset.

Be sure to standardize all other units of measurement in your database. These include weight, distance, and temperature. As for dates, choose either the USA style or the European format.

7. Validate the Data

Validation ensures your data is correct and ready for meaningful analysis. So, check all your data for accuracy after changing it. You may need an interactive, AI-powered tool to do this. Critical considerations in the final stages of data cleansing include ensuring that:

  • Your data meets pre-established range constraints
  • Each input value is of the mandated data type
  • There are no missing values for mandatory fields
  • There are no nonsensical values

Final Thoughts…

Data cleansing paves the way for meaningful analytics. Similarly, clean data is a key driver of intelligence-driven business decisions.

So, would you like to expedite your data cleansing workflows? Try Xplenty right away!

You might also like our other posts...