Raw data—like unrefined gold buried deep in a mine—is a precious resource for modern businesses. But before you can benefit from raw data, you need to extract it, sift through it, understand it, and transform it into something you can analyze.
That’s where ETL (extract, transform, load) pipelines come into play. As a vital stage of the ETL process, data transformation is necessary to change the information into a format or schema that a business intelligence platform can interact with to derive actionable insights.
However, as essential as data transformation is, only data engineers and data scientists tend to understand it. To demystify data transformation for you and your team, we've covered this topic with the following sections:
Before we begin, you might be wondering how data transformation relates to Iron.io. As a compute-heavy process, performing data transformations on-site or in the cloud can tax the average server setup to the point of shutting it down. That's why many organizations are turning to Iron.io's server-less platform and its multi-cloud, Docker-based job processing, which can perform complex data transformations in a container, as needed, without overburdening your data system.
Data Transformation: Explained
Data transformation involves the conversion of data from one structure (or no structure) to another so you can integrate it with a data warehouse or with different applications. The process lets you expose the information to advanced business intelligence tools to create valuable performance reports and forecast future trends.
Data transformation includes two primary stages: (1) understanding and mapping the data; and (2) transforming the data. To dive deeper into the best practices for data transformation and some of the challenges you may come across, we wrote this blog post to help you throughout the data transformation process.
(1) Understanding and Mapping the Data
During the first stage of data transformation, you’ll identify and study your data sources. These sources could be databases attached to different business systems, accounting software, and CRM platforms. They could also be streaming sources, customer log files, web application data, or relate to mobile app usage statistics, and more.
After studying and understanding the data sources, you’ll determine the data structures you’re dealing with, and the types of transformations required to connect them. For example, what do the columns and rows look like? How are they labeled? What kind of information do they contain? And, how does information in one data source relate to another source?
Next, you'll perform "data mapping" to define how the fields in different data sources connect together, and what types of data transformations they require.
(2) Transforming the Data
During the second stage of data transformation, you will carry out the different data transformations that you mapped in the first stage. There are several strategies for doing this:
- Hand-Coding ETL Solutions Through Scripting: Traditionally, you would set up your ETL process through scripting, by hand-writing code in SQL or Python. This was a task you'd give to offsite developers, and it was a time-consuming process. Moreover, because offsite developers had to interpret your requirements, misunderstandings could result, causing unintentional errors.
- Onsite Server-Based ETL Solutions: Onsite ETL solutions work through your onsite servers to extract, transform, and load information into an onsite data warehouse. Although most companies are now moving to cloud-based data warehousing and ETL solutions, onsite ETL still has its place. Compared to offsite scripting solutions, onsite ETL offers the benefit of more oversight by the end-user. However, you may need to hire expert staff to manage it.
- Cloud-Based ETL Solutions: Cloud-based ETL solutions like Xplenty are the future of data transformation. They work through the cloud rather than an onsite server. These solutions are particularly useful when you're linking cloud-based software as a service (SaaS) platforms like SalesForce to a cloud-based data warehouse like Amazon Redshift. They can also help you integrate an onsite business system to a cloud-based data warehouse. Many feature drag-and-drop graphical interfaces that make it easy for end-users to manipulate and control their data transformations (with no expertise in data science). Furthermore, they offer automatic integrations that automatically perform the transformations your data requires.
Types of Data Transformations
Below we've listed the types of transformations that you, your ETL platform, or your data team may need to perform during the ETL process. Although the majority of these tasks can happen automatically with a data transformation platform, sometimes you may need to set up and code ETL processes yourself.
Data deduplication is a data compression process where you identify and remove duplicate or repeated copies of information. Also referred to as single-instance storage, intelligent compression, commonality factoring, or data reduction, deduplication allows you to store one unique copy of data in your data warehouse or database.
The deduplication process analyzes incoming data and compares it to data that’s already stored in the system. If the data is already there, deduplication algorithms delete the duplicate information while creating a reference to it. If you upload a changed version of a previous file, the system will back up said file while adding the changes to the data segment. Deduplication algorithms also keep track of outgoing data to delete duplicates, which speeds up the information transfer process.
When the tables in a data warehouse have keys with built-in meanings, serious problems can develop. For example, if a client phone number serves as a primary key, changing the phone number in the original data source means that the number would have to change everywhere it appears in the data system. That would cause a cascade of updates that over-burden or slow down the system.
Through key restructuring, you can transform any keys with built-in meanings to generic keys—i.e., random numbers that reference back to the source database with the actual information. By drawing key connections from one table to another, key restructuring optimizes the data warehouse for speed and efficiency.
Data cleansing involves deleting out-of-date, inaccurate, incomplete, and duplicate information to increase the accuracy of data. Also referred to as data scrubbing and data cleaning, data cleansing relies on the careful analysis of datasets and data storage protocols to support the most accurate data possible. The process might include parsing data to remove syntax errors, deleting record fragments, and correcting typos. It could also involve fixing duplication problems that result from merging multiple datasets.
Data validation is the process of creating automated rules or algorithms that engage when the system encounters different data issues. Data validation helps ensure the accuracy and quality of the data you transform. For example, a rule could go into effect when the system finds that the first three fields in a row are empty (or NULL value). The rule might flag the row for the end-user to investigate later, or stop the system from processing the row altogether.
Format revisions fix problems that stem from fields having different data types. Some fields might be numeric, and others might be text. One data system could treat text versus numeric information differently, so you might have to standardize the formats to integrate source data with the target data schema. This could involve the conversion of male/female, date/time, measurements, and other information into a consistent format.
Field lengths can also be an issue—especially if the target schema has smaller character limits. In these cases, it may be necessary to standardize the length of fields by breaking up long serial numbers into their smaller parts and putting them into separate columns.
Additionally, format revision could involve splitting up a comma-separated list of words or numbers into multiple columns.
Data derivation involves the creation of special rules to “derive” the specific information you want from the data source. For example, you might have a database that includes total revenue data from sales, but you’re only interested in loading the profit figures after subtracting costs and tax liabilities. Data derivation allows you to create a set of transformation rules that subtract costs and taxes from the total revenue information.
Data aggregation is a process that searches, gathers, summarizes and presents data in different reports. Let’s say you have a list of male and female employees and their salaries, and you want to know the total male salaries compared to the total female salaries. You can aggregate the list by male and female, then sum up the total salaries for each group.
You can use a business intelligence platform to perform data aggregations based on the insights decision-makers need, or you can perform manual aggregations by coding in SQL.
- Data Summarization: Data Summarization is similar to data aggregation. It refers to the creation of different business metrics through the calculation of value totals. You could sum up the total revenue of all the sales made by the individual salespeople on your staff, then create sales metrics that reveal total sales for individual time-periods.
Data integration is the process of taking different data types (like different databases and datasets relating to sales, marketing, and operations) and merging them into the same structure or schema. As a primary goal of ETL for data warehousing purposes, data integration supports the analysis of massive data sets by merging multiple data sources into an easy-to-analyze whole.
At its core, data integration reconciles differing values and names that refer to the same data elements within the data warehouse. By giving each element a standard name and definition, data integration makes it possible to analyze the information with a business intelligence platform.
Data filtering includes techniques used to refine datasets. The goal of data filtering is to distill a data source to only what the user needs by eliminating repeated, irrelevant, or overly sensitive data. Data filters can be used like this to amend query results and data reports.
In its most practical form, data filtering simply involves the selection of specific rows, columns, or fields to display from the dataset. For example, if the end-user doesn’t need to see the addresses or Social Security numbers of each client in the report, data filtering will scrub them from the report.
Joining data is one of the most important functions of data transformation. A “join” is an operation in the SQL database language that allows you to connect two or more database tables by their matching columns. This allows you to establish a relationship between multiple tables, which merges table data together so you can query correlating data on the tables.
After reading this piece, you should have an excellent introduction—and point of reference—for what data transformation is and what it entails. Also, as we established at the beginning of this piece, you might want to consider Iron.io's server-less platform and its multi-cloud, Docker-based job processing to manage the compute-heavy nature of performing large data transformations. Finally, if you'd like to go deeper into the topic of data transformation and ETL, we recommend our blog post "Introduction to Data Integration OR What Is ETL?"
Xplenty: Fast, Powerful Data Transformation (the Easy Way)
As a cloud-based ETL solution, Xplenty, offers hundreds of ready-made data transformations for the most popular SaaS platforms. Xplenty lets you immediately load data from your on-site and cloud-based data sources like Salesforce, Google Sheets, and SQL databases into your data warehouse. Learn more about our wide range of easy-to-use ETL solutions by exploring the Xplenty website now.