There comes a time when data analysis needs outgrow the capacity of the standard reporting applications. Once you realize that standard solutions like Google Analytics or Mixpanel simply can’t do what you need them to, you might consider building a custom business intelligence (or BI) solution. The foundation of your new BI solution will be the data integration layer, also referred to as ETL. In this article, we’ll drill down in what exactly ETL is, and how your organization can benefit from it.
Table of Contents
What is ETL?
Your company has access to many data sources but, more often than not, that data is presented in a way that is less useful to you. Simply defined, ETL (or Extract, Transform, Load) is the collecting and processing of data from various sources into one data store where it can then be analyzed. The results of that analysis can then be used to inform your business strategies and decisions.
Let's look at each piece of the ETL Process more closely.
Extracting data is the act of targeting a data source and pulling the data from it so that it can be transformed, integrated and stored elsewhere. We can target many different databases of various types for extractions, and we can run each extraction on a schedule so that you get a regular flow of current and accurate data.
Since data doesn’t necessarily come to you in the form that you want, it usually needs to be transformed. Maybe you want to limit the data that you are storing to just a few fields or sort it so that all the columns are in a certain order. Perhaps you want to join several tables together, or maybe you have a messy database full of duplicate records that need to be cleaned. Transformation is the step in the ETL process that will take care of preparing the data to make it most useful to you when you access it.
Finally, once the data has been sorted, clean, validated and prepared, you want to load that data somewhere. The most common load target is a data warehouse, where you can keep it for future analysis and tracking trends.
For more information, refer to A Deep Dive into Extract, Transform, Load later in this article.
Related Reading: ETL vs ELT
Implementing ETL in a Data Warehouse
Mirror/Raw layer - This layer is a copy of the source files or tables, with no logic or enrichment. Source data is copied and added to the target mirror tables, which then hold historical raw data that is ready to be transformed.
Staging layer - Once the raw data from the mirror tables is transformed, all transformations are stored in staging tables. These tables hold the final form of the data for the incremental part of the ETL cycle in progress.
Schema layer - These are the destination tables, which contain all the data in its final form after cleansing, enrichment and transformation.
Aggregating layer - In some cases it is beneficial to aggregate data to a daily or store level from the full dataset. This can improve report performance, enable the addition of business logic to calculated measures and make it easier for report developers to understand the data.
Why Do I Need ETL?
The short answer is that ETL saves significant time on data extraction and preparation, which is time that could be better spent on evaluating and using that information in your business.
Each of the 3 main components in the ETL saves time and development effort by doing it once in a dedicated data flow:
- Extract – There is a saying, “A chain is only as strong as its weakest link.” In the context of ETL, the strength of the chain is also determined by its first link. The extract stage determines different data sources, refresh rate (velocity) of each source, and priorities (extract order) between them – all of which heavily impact time-to-insights.
- Transform - After extracting the data into an ETL environment, transformations bring clarity and order to the initial data swamp. For example, dates are consolidated into specified time buckets, strings are parsed to their business meanings, transactions are modeled into events, location data is translated to coordinates, zip codes or cities/countries, measures are summed up, averaged, rounded and useless data and errors are set aside for later inspection. PII data can be masked for GDPR, CCPA and other privacy requirements.
- Load – In the last phase, much as in the first, targets and refresh rates are determined. Moreover, the load phase determines whether loading will be done by increments or “upsert” (update existing data and insert new data) is required for the new batches of data.
How Does Modern ETL Help My Business?
Today's data is indeed big - not only in size but in impact, possible interpretations, and use cases. Beyond managing and monitoring the revenue flow and users, each department in a modern organization requires unique insights from big data:
- Sales requires quality information about prospective customers
- Marketing needs to assess campaign conversion rates and future strategy
- Customer Success wants to drill down to the single row event in order to solve or improve customer needs and problems.
In order to satisfy such diverse data demands without getting lost in parallel data versions, ETL creates an environment that sustains democratic data governance:
Data governance - The overall management of availability, usability, integrity, and security of enterprise data. This is the process of creating coherent and connected data layers. Data governance enables data democracy by delivering a holistic view the ever-expanding data universe for all data clients.
Data democracy - Helps everyone who needs access to data analytics within your company to reduce steep learning curves, ask the right data questions, and be involved in the process of clarifying the answers.
A Deep Dive into Extract, Transform, Load
In this section, we'll take an in-depth look at each of the three steps in the ETL process.
ETL can be implemented with scripts (custom DIY code) or with a dedicated ETL tool. It performs a number of important functions including:
Parsing/Cleansing - Data generated by applications is created in various formats like JSON, XML or CSV. During the parsing stage, data is mapped into a table format with headers, columns and rows, and the specified fields are extracted.
Data Enrichment - In order to prepare data for analytics, certain enrichment steps are usually required, including: tweaking, injecting expert knowledge, geo modifications, matching between sources and correcting bugs.
Setting Velocity – Velocity refers to the frequency of data loading, whether new data should be inserted, or if existing data needs to be updated.
- Data Validation - There are cases where data is empty, corrupted, missing crucial elements, too thin or too bloated. ETL finds these occurrences and determines whether to stop the entire process, skip it or set it aside for inspection while alerting the relevant administrators.
Data extraction involves the following four steps:
- Identify the data to extract: The first step of data extraction is to identify the data sources you want to incorporate into your data warehouse. These sources might be from relational SQL databases like MySQL. They could also be non-relational NoSQL databases like MongoDB or Cassandra. The information could also be from a software as a service (SaaS) platform such as Salesforce and other applications. After identifying the data sources, you'll also want to determine the specific data fields you want to extract.
- Estimate how large the data extraction is: The size of the data extraction matters. Are you extracting 500 gigabytes or 50 petabytes of data? A larger quantity of data will require different ETL strategies. For example, you can make a larger dataset more manageable by aggregating it to month-level rather than day-level, which reduces the size of the extraction. Alternatively, you can upgrade your server settings to handle the larger dataset.
- Choose the extraction method: Since the data warehouse needs to update its data stores continually for the most accurate reports, data extraction is an ongoing process that may need to happen on a minute-by-minute basis. There are three main methods for extracting information:
- Update notifications: The preferred method of extraction involves update notifications. The source system will send a notification when one of its records has changed. Then the data warehouse updates with the new information only.
- Incremental extraction: The second method, when update notifications aren’t possible, is incremental extraction. This involves the identification of changed records, and an extraction that updates only those records. A potential setback is that incremental extraction cannot always identify deleted records.
- Full extraction: When the first two methods won't work, a complete update of all the data through full extraction is necessary; keep in mind that this method only works well with smaller data sets.
- Assess your SaaS platforms: Businesses formerly relied on in-house applications for accounting and other record keeping. These applications used OLTP transactional databases that they maintained on an onsite server. Today, more businesses use SaaS platforms like Google Analytics, HubSpot, and Salesforce. To pull data from one of these platforms, you’ll need a solution like Xplenty that integrates with the unique API of the platform.
Cloud-based ETL solutions like Xplenty extract data from popular SaaS APIs by:
- Engineering out-of-the-box API integrations for the most popular SaaS applications (browse Xplenty's hundreds of out-of-the-box AI integrations here).
- Navigating complex REST APIs, even converting SOAP to REST automatically.
- Creating strategies to deal with custom resources and fields—and the many built-in resource endpoints—found in different SaaS APIs.
- Doing the work to provide constant updates and fixes for data connections that fail—for example, Salesforce might update its API but fail to notify users, resulting in a scramble to find a solution. ETL platforms like Xplenty develop relationships with SaaS developers to receive advanced notice of these kinds of updates before they go live, which prevents unwanted surprises.
In traditional ETL strategies, data transformation that occurs in a staging area (after extraction) is called “multistage data transformation.” In ELT, data transformation that happens after loading data into the data warehouse is called “in-warehouse data transformation.”
Whether you choose ETL or ELT, you may need to perform some of the following data transformations:
- Deduplication (normalizing): Identifies and removes duplicate information.
- Key restructuring: Draws key connections from one table to another.
- Cleansing: Involves deleting old, incomplete, and duplicate data, to maximize data accuracy—perhaps through parsing to remove syntax errors, typos, and fragments of records.
- Format revision: Converts formats in different datasets—like date/time, male/female, and units of measurement—into one consistent format.
- Derivation: Creates transformation rules that apply to the data. For example, maybe you need to subtract certain costs or tax liabilities from business revenue figures before analyzing them.
- Aggregation: Gathers and searches data so it can be presented in a summarized report format.
- Integration: Reconciles diverse names/values that apply to the same data elements across the data warehouse, so that each element has a standard name and definition.
- Filtering: Selects specific columns, rows, and fields within a dataset.
- Splitting: Splits one column into more than one column.
- Joining: Links data from more than one source, like adding spend information across more than one SaaS platform.
- Summarization: Creates different business metrics by calculating value totals. For example, you might add up all the sales made by a specific salesperson to create total sales metrics for specific periods.
- Validation: Sets up automated rules to follow in different circumstances. For example, if the first five fields in a row are NULL, then you can flag the row for investigation, or prevent it from being processed with the rest of the information.
Data loading is the process of loading the extracted information into your target data repository. Loading is an ongoing process that could happen through "full loading" (the first time you load data into the warehouse) or "incremental loading" (as you update the data warehouse with new information). Because incremental loads are the most complex, this section will focus on them.
Types of Incremental Loads:
Incremental loads extract and load new information that appeared since the last incremental load. This can happen in two ways:
- Batch incremental loads: Are when the data warehouse ingests information in packets or batches. If it's a particularly large batch, it's best to carry out a batch load during off-peak hours—on a daily, weekly, or monthly basis—to prevent system slowdowns. However, modern data warehouses can also ingest small batches of information on a minute-by-minute basis with an ETL platform like Xplenty. This allows them to achieve an approximation of real-time updates for the end-user.
- Streaming incremental loads: Are when the data warehouse ingests new data as it appears, in real-time. They are particularly valuable when the end-user requires real-time updates for perfectly-timed decisions. That said, streaming incremental loads are only possible when the updates involve a very small amount of data. In most cases, minute-by-minute batch updates offer a more robust solution than real-time streaming.
Incremental Loading Challenges:
Incremental loads can disrupt system performance and cause a host of problems, including:
- Data structure changes: Data formats—in your data sources or data warehouse—may need to evolve according to the needs of your information system. However, changing one part of the system could lead to incompatibilities that interfere with the loading process. To prevent problems relating to inconsistent, corrupt, or incongruent data, it’s important to zoom out and review how small changes affect the total ecosystem. Then you can make appropriate adjustments across the system.
- Processing data in the wrong order: Data pipelines can follow complex trajectories that result in your data warehouse processing, updating, or deleting information in the wrong order. That can lead to corrupt or inaccurate information. For this reason, it’s vital to monitor and audit the ordering of data processing.
- Failure to detect problems: When you don't know about a problem, it invariably gets worse. Therefore, it's important to quickly detect when an API goes down, when your API access credentials are out-of-date, when system slow-downs interrupt dataflow from an API, and when the target data warehouse is down. The sooner you detect the problem, the faster you can fix it, and the easier it is to correct the inaccurate/corrupt data that results from it.
Ready to move forward with your custom ETL Solution?
Find out how easy it is to transform your data with Xplenty here.
Originally published on: March 22nd, 2019