According to the Elephant Care Manual for Mahouts and Camp Managers: “It is essential to cleanse the elephant's body carefully every day by using half of a coconut shell to scrape the elephant on a daily basis.” ETL developers may not have coconuts at their disposal, but some of them may still need to do the dirty work of cleansing Big Data.
Xplenty is to data cleansing as Whirlpool is to washing dishes. Create an account and take the platform for a free 14 day spin.
Data cleansing/cleaning/scrubbing is the process of fixing or removing corrupt or inaccurate data. This process is needed because incorrect data can lead to wrong conclusions and bad analysis, especially considering the huge quantities of Big Data. Just take a look at the UK banking industry which is losing hundreds of millions of pounds due to Big Bad Data.
Lack of validation
Data is not always validated properly when entered, so it requires fixing after it is already in the database. For example, phone numbers may be written with or without spaces, with letters instead of numbers, etc., thus causing the same number to appear in different variations.
Data from different sources
Data sources, whether originating from inside or outside of the organization, may need cleansing when combining them together. Data inside the organization could arrive from different applications. These applications may provide data about the same entities, but not the same identities. Data from outside the organization, such as social network data, could already arrive with ambiguous or corrupt information.
The same name could be written in several ways - full name, full name with comma, last name only, and more. For example: “Wile E. Coyote”, or “Coyote, Wile E.”, or “Coyote”, and so forth.
Although Bombay is the same city as Mumbai (British name vs. Hindu name), different people may use either name while meaning the same thing. Even the US could be noted as USA or written out as United States of America or just United States.
Dates present a particularly big data migraine as there are plenty of formats. The same date could be written as 8 November 2003, 11/8/2003, 08.11.2003, 8-11-2003, 08-Nov-2003, just to name a few.
Dots and commas have a different role when it comes to writing numbers in German - the number 1,024.56 would be written by Germans as 1.024,56. Even Americans could write the same number without the comma as 1024.56 while Australians would write it with a space character as 1 024.56.
Financial data may arrive in different currencies. They have a different monetary value and different symbols or codes that may need to be consolidated together, e.g. $45, 45 USD, 1460 THB.
Data could also arrive in different languages such as English, Chinese, or Arabic - not just freetext, but also personal names, addresses, and so on.
Upper/lower case ambiguity, spelling mistakes, and other data may need to be cleansed depending on the context of your business.
Running histograms on the data can determine which values appear less frequently and could therefore be invalid. These values can then be updated, although this is a problem with Hadoop which does not provide an update functionality.
In case certain data issues are already known, for instance, that the US is written out in several different ways, conversion tables could be used. The data should first be sorted by the relevant key, lookups could be used to make the conversions, and finally, the results should be stored for further use.
Various vendors such as IBM, SAS, Oracle, Lavastorm Analytics, and Talend provide data cleansing solutions. Free tools include Open Refine, plyr, and reshape2, though it is unclear whether they can handle Big Data.
Spell checking or phonetic algorithms can be used to fix some of the data. Nonetheless, they could also corrupt it by making the wrong suggestions, so some manual work may be necessary.
Speaking of which, most of the data is usually cleaned by hand. Even with the help of tools, histograms, and algorithms, human intervention is still needed to understand and fix the data.
Big Data includes bad data which needs to be cleansed to get better analytics and possibly save a lot of money. How much data is being cleansed? Not much. According to a survey by TEK Systems, 60 percent of IT leaders claim that their organizations lack accountability for data quality. It seems most companies would rather look the other way and let their elephants stay dirty.