Data transformation is the process of converting data from its current structure into a new structure.
Transformation is the T in ETL (Extract, Transform, Load). Transformation prepares data for other uses, such as analytics, warehousing, or use in production systems.
Why is Data Transformation Used?
Data transformation is used when moving data from one location to another, or when repurposing data to meet new requirements. The underlying data values remain the same in transformation, but the structure is altered to match the required structure.
Transformations can be applied to structured and semi-structured data. For example, here is some structured data in the form of a relational database:
This structure might work in the original context, but the organization may need to copy that data to another database. The destination database could have a different structure, like this:
During the transition, the source data must be altered to fit the structure of the destination database. With large production databases, this is generally done with an ETL process:
- Extract data from the source
- Transform source data according to a schema (the transformation schema is defined on the ETL platform)
- Load transformed data into the destination
This process is used for several reasons, such as:
- Compatibility: Most production systems have a database layer, and these databases each have different structures. Transformation is required when moving from one system to another, like when data is copied from a Customer Relationship Management system to a sales platform.
- Consistency: Transformation can apply a universal table structure to disparate data sources, which speeds up analytics projects. Transformation also includes cleansing, integrating, and harmonizing data, resulting in high-quality output.
- Storage: Some storage repositories, such as data warehouses and data marts, work best when data is in a standard format. Passing data through a transformation layer allows for more efficient storage and faster retrieval.
In an automated ETL process, the transformation happens on the back-end using the techniques described below.
What is the Data Transformation Process?
When data is first extracted, it is hosted in a temporary data repository known as the staging area.
In the staging area, data is transformed to meet requirements. In the case of an automated ETL, these transformations are performed automatically according to a schema defined by administrators. Xplenty provides a drag-and-drop interface that allows the transformation schema to be configured without coding.
At the back end, the transformation process can involve several steps:
- Key restructuring: While data is transformed to meet the requirements of the destination schema, entity relationships are preserved. Transformation almost always involves some disruption to primary and foreign keys, so one of the main objectives is to avoid breaking any relationships.
- Formatting: Columns are converted into the appropriate data format. For example, dates may be stored as an integer or a text field in the source database. The transformation layer can apply a consistent date type to all relevant data.
- Mapping: Data is copied from one column to a different column in the destination database. Data mapping confirms the relationship between source columns and destination columns, so the right data lands in the right place.
- Concatenation: Compatible data values are combined into a single value. For instance, FIRST_NAME and LAST_NAME can be concatenated and mapped to a single NAME column. Values can also be disassociated using the same logic in reverse.
- Splitting/Joining: A single column can be broken into multiple columns if required. Similarly, multiple columns can be unified.
- Aggregation: Data can be simplified or summarized where required. For example, a list of invoice totals can be aggregated per client ID, giving a single lifetime value figure for each customer.
- Normalization: Duplicated information is removed. In a relational database, this might involve splitting data into multiple tables and linking them with a primary or foreign key.
- Cleansing: Corrupt, inaccurate, or invalid data is identified and repaired. This might be done by reformatting the item, deleting it, or finding the correct value through a lookup. The transformation stage is an ideal place to apply data cleansing methods, ensuring a high level of data quality when loading to the destination.
- Standardization: Values can be converted to agree with a single format. An example of this is state names. Databases may use different formats for states, such as California, Cali, CAL, CA. Each format is valid, but these values need to be standardized when multiple sources are integrated. Standardization of this kind is sometimes done with a lookup table, which lists the correct value and all common variation.
- Validation: Data values are tested to see if they fall within logical limits. For instance, dates may need to fall within a certain range, while URLs should resolve to a working location.
- Obfuscation: Sensitive data can be masked or otherwise obscured before being loaded to the destination repository. Obfuscation is commonly used when the destination data is being used for testing or analytics. This allows the end-users to work with a functioning version of data, but without exposing any personal details or business secrets.
What Happens After Data Transformation?
Once data has been transformed, it is then transferred to the destination repository.
This is also the basis of data integration. Multiple, disparate sources can pass through the integration layer, and then emerge in a standardized format. All of this data can then be loaded to a data repository, where it is available as a single data source.
Data owners will often perform a quality check on transformed data, especially after the first run-through of a new ETL process. These checks will look at things like:
- Is the transformed data complete and accurate?
- Are all relationships intact?
- Are there any duplicates
- Are all values in a valid format or range?
- Did any values become corrupted during the transformation?
- Did any invalid data reach the destination repository?
- Has any sensitive data been unnecessarily exposed?
- Does the finished data set meet the business requirements of this project?
If the transformed data meets the required criteria, the data owners may choose to set up a data pipeline. This is essentially an ongoing ETL process that automatically takes data from sources, processes it through the transformation layer, and loads it to the destination, all without the need for further intervention.