There comes a time when data analysis needs outgrow the capacity of the standard reporting applications. Once you realize that standard solutions like Google Analytics or Mixpanel simply can’t do what you need them to, it may be time to consider building a custom BI solution. To learn about all the components of such a solution click here.
The foundation of your new BI solution will be the data integration layer, also referred to as ETL. In this article, we’ll drill down in what exactly ETL is, and how your organization can benefit from it.
What Exactly is ETL?
ETL refers to three processes - Extract, Transform and Load. Simply defined, ETL enables the collection of data from various sources into one data store, ready for analysis.
ETL can be implemented with scripts (custom DIY code) or with a dedicated ETL tool. ETL performs a number of important functions including:
Parsing/Cleansing - Data generated by applications is created in various formats like JSON, XML or CSV. During the parsing stage, data is mapped into a table format with headers, columns and rows, and the specified fields are extracted.
Data Enrichment - In order to prepare data for analytics, certain enrichment steps are usually required, including: tweaking, injecting expert knowledge, geo modifications, matching between sources and correcting bugs.
Setting Velocity – Velocity refers to the frequency of data loading, whether new data should be inserted, or if existing data needs to be updated.
Data Validation - There are cases where data is empty, corrupted, missing crucial elements, too thin or too bloated. ETL finds these occurrences and determines whether to stop the entire process, skip it or set it aside for inspection while alerting the relevant administrators.
Why Do I need a Data Integration Layer?
The short answer is that ETL saves significant time on data extraction and preparation - time that could be better spent on extracting insights.
Each of the 3 main components in the ETL saves time and development effort by doing it once in a dedicated data flow:
Extract – There is a saying, “A chain is only as strong as its weakest link.” In the context of ETL, the strength of the chain is also determined by its first link. The extract stage determines different data sources, refresh rate (velocity) of each source, and priorities (extract order) between them – all of which heavily impact time-to-insights.
Transform - After extracting the data into an ETL environment, transformations bring clarity and order to the initial data swamp. For example, dates are consolidated into specified time buckets, strings are parsed to their business meanings, transactions are modeled into events, location data is translated to coordinates, zip codes or cities/countries, measures are summed up, averaged,rounded and useless data and errors are set aside for later inspection. PII data can be masked for GDPR, CCPA and other privacy requirements.
Load – In the last phase, much as in the first, targets and refresh rates are determined. Moreover, the load phase determines whether loading will be done by increments or “upsert” (update existing data and insert new data) is required for the new batches of data.
Implementing ETL in a DWH
When an ETL process is used to load a database into a data warehouse (DWH), each phase is represented by a physical layer:
Mirror/Raw layer - This layer is a copy of the source files or tables, with no logic or enrichment. Source data is copied and added to the target mirror tables, which then hold historical raw data that is ready to be transformed.
Staging layer - Once the raw data from the mirror tables is transformed, all transformations are stored in staging tables. These tables hold the final form of the data for the incremental part of the ETL cycle in progress.
Schema layer - These are the destination tables, which contain all the data in its final form after cleansing, enrichment and transformation.
Aggregating layer - In some cases it is beneficial to aggregate data to a daily or store level from the full dataset. This can improve report performance, enable the addition of business logic to calculated measures and make it easier for report developers to understand the data.
What does Modern ETL Require?
Today's data is indeed big - not only in size but in impact, possible interpretations and use cases.
Beyond management and monitoring of revenue flow and users, each element of the modern organization requires unique insights from big data:
- Sales requires quality information about prospective customers
- Marketing needs to assess campaign conversion rates and future strategy
- Customer Success wants to drill down to the single row event in order to solve or improve customer needs and problems. And more…
In order to satisfy such diverse data demands, without getting lost in parallel data versions, ETL creates an environment that sustains democratic data governance, delivering:
Data governance - The overall management of availability, usability, integrity, and security of enterprise data. This is the process of creating coherent and connected data layers.Data governance enables data democracy by delivering a holistic view the ever-expanding data universe for all data clients.
Data democracy - Helps everyone who needs access to data analytics within your company to reduce steep learning curves, ask the right data questions, and be involved in the process of clarifying the answers.
Ready to move forward with your ETL? Find out how easy it is to integrate your data with Xplenty here.