ETL refers to the extraction, transformation, and loading of data from one data source to another. As a necessary part of setting up a modern data analytics system, ETL allows you to transform incompatible data from multiple sources, then integrate it into a single target data warehouse. That way, advanced business intelligence (BI) platforms can analyze the data as a whole to produce the insights and forecasting you need to make better decisions.
To help you understand the ETL process in modern data warehousing, this guide will explain the following:
- A Brief Overview of ETL (extract, transform, load) and ELT (extract, load, transform)
- The Stages of ETL and ELT
- The Technical Aspects and Challenges of ETL
A Brief Overview of ETL and ELT
Since ETL and ELT are important options in modern data integration, we'll use this section to define them. To begin, the three stages of ETL and ELT include:
- Extraction: Involves the extraction of data from the data sources. With ETL, you generally extract structured information from online transactional processing (OLTP) databases (such as MySQL or PostgreSQL databases), which are optimized for writing, editing and saving information in the SQL database language. With ELT, you may be extracting both structured and unstructured information from your data sources.
- Transformation: Involves cleansing and optimizing information to prepare it for analysis. ETL involves the transformation of the data within a staging area before loading it into the data repository. ELT involves the transformation of information after loading it into the data repository.
- Loading: Involves putting the information into the data warehouse. With ETL, you'll load the data into an online analytical processing (OLAP) data warehouse—a data repository optimized for reading and analyzing data in a relational format. With ELT, you'll load the data into a "data lake"—a data repository optimized for storing both structured and unstructured information.
As you can see, the primary difference between ETL and ELT relates to the order at which the data transformations occur. To get into more detail or if you need help choosing the right tool for your business, this blog post contains a side-by-side ETL vs ELT comparison with common questions you might have.
ETL and OLAP Data Warehouses
Data engineers have been using ETL for over two decades to integrate diverse types of data into online analytical processing (OLAP) data warehouses. The reason for doing this is simple: to make data analysis easier.
Normally, business applications use online transactional processing (OLTP) database systems. These are optimized for writing, updating, and editing the information inside them. They’re not good at reading and analysis. However, online analytical processing database systems are excellent at high-speed reading and analysis. That’s why ETL is necessary to transform OLTP information, so it can work with an OLAP data warehouse.
During the ETL process, information is:
- Extracted from various relational database systems (OLTP or RDBMS) and other sources.
- Transformed within a staging area, into a compatible relational format, and integrated with other data sources.
- Loaded into an online analytical processing (OLAP) data warehouse server.
In the past, data engineers hand-coded ETL pipelines in R, Python, and SQL—a laborious process that could take months to complete. Today, hand-coded ETL continues to be necessary in many cases. However, modern ETL solutions like Xplenty allow data teams to skip hand-coding and automatically integrate the most popular data sources into their data warehouses. This has dramatically increased the speed of setting up an ETL pipeline, while eliminating the risk of human error.
As this Xplenty user notes, “The speed and consistency of Xplenty is impressive, and it more than makes up for what a few tools in our kit may be lacking."
Once the data is integrated into your data warehouse, the highly-efficient nature of the OLAP data system allows for stable, rapid analysis.
ELT and Data Lakes
In contrast to the more widely-used ETL, ELT introduces more flexibility into the data transformation/integration process. By loading data into a "data lake" instead of a structured OLAP data warehouse, you can upload and save all types of structured and unstructured information for use at a later time.
ELT and data lakes take advantage of the high-powered processing offered by modern, cloud-based servers like Snowflake, Google BigQuery, and Redshift. These servers are so powerful that they can perform data transformations on the fly, allowing ELT to skip the staging area and transform only the data you need to analyze at the moment.
In other words, ELT performs transformations right before introducing the data to your BI tools. Because ELT and data lakes work with raw, unstructured information, they allow you to introduce unstructured information like emails and written customer survey responses to machine learning algorithms to derive new kinds of insights.
While ELT lets you store and work with unstructured information, there are important challenges associated with the ELT process that businesses need to be aware of:
- It's still a new technology that continues to be refined.
- The data lakes and ELT tools required for the process are still evolving.
- It's harder to find experts who can develop ELT pipelines.
- ELT is not yet as stable and reliable as ETL.
Even though the upload process in ELT is fast, the need to perform data transformations each time you want to analyze information slows down analysis compared to the high-speed queries and analyses possible with a pre-structured OLAP data warehouse.
In-Depth: Extraction, Transformation, Loading
In this section, we'll look at extraction, transformation, and loading in-depth:
Data extraction—i.e. the pulling of information from various data sources—involves the following four steps:
1. Identify the data to extract: The first step of data extraction is to identify the data sources you want to incorporate into your data warehouse. These sources might be from relational SQL databases like MySQL. They could also be non-relational NoSQL databases like MongoDB or Cassandra. The information could also be from a software as a service (SaaS) platform such as Salesforce and other applications. After identifying the data sources, you'll also want to determine the specific data fields you want to extract.
2. Estimate how large the data extraction is: The size of the data extraction matters. Are you extracting 500 gigabytes or 50 petabytes of data? A larger quantity of data will require different ETL strategies. For example, you can make a larger dataset more manageable by aggregating it to month-level rather than day-level, which reduces the size of the extraction. Alternatively, you can upgrade your server settings to handle the larger dataset.
3. Choose the extraction method: Since the data warehouse needs to update its data stores continually for the most accurate reports, data extraction is an ongoing process that may need to happen on a minute-by-minute basis. There are three main methods for extracting information:
- Update notifications: The preferred method of extraction involves update notifications. The source system will send a notification when one of its records has changed. Then the data warehouse updates with the new information only.
- Incremental extraction: The second method, when update notifications aren’t possible, is incremental extraction. This involves the identification of changed records, and an extraction that updates only those records. A potential setback is that incremental extraction cannot always identify deleted records.
- Full extraction: When the first two methods won't work, a complete update of all the data through full extraction is necessary; keep in mind that this method only works well with smaller data sets.
4. Assess your SaaS platforms: Businesses formerly relied on in-house applications for accounting and other record keeping. These applications used OLTP transactional databases that they maintained on an onsite server. Today, more businesses use SaaS platforms like Google Analytics, HubSpot, and Salesforce. To pull data from one of these platforms, you’ll need a solution like Xplenty that integrates with the unique API of the platform.
Cloud-based ETL solutions like Xplenty extract data from popular SaaS APIs by:
- Engineering out-of-the-box API integrations for the most popular SaaS applications (browse Xplenty's hundreds of out-of-the-box AI integrations here).
- Navigating complex REST APIs, even converting SOAP to REST automatically.
- Creating strategies to deal with custom resources and fields—and the many built-in resource endpoints—found in different SaaS APIs.
- Doing the work to provide constant updates and fixes for data connections that fail—for example, Salesforce might update its API but fail to notify users, resulting in a scramble to find a solution. ETL platforms like Xplenty develop relationships with SaaS developers to receive advanced notice of these kinds of updates before they go live, which prevents unwanted surprises.
In traditional ETL strategies, data transformation that occurs in a staging area (after extraction) is called “multistage data transformation.” In ELT, data transformation that happens after loading data into the data warehouse is called “in-warehouse data transformation.”
Whether you choose ETL or ELT, you may need to perform some of the following data transformations:
- Deduplication (normalizing): Identifies and removes duplicate information.
- Key restructuring: Draws key connections from one table to another.
- Cleansing: Involves deleting old, incomplete, and duplicate data, to maximize data accuracy—perhaps through parsing to remove syntax errors, typos, and fragments of records.
- Format revision: Converts formats in different datasets—like date/time, male/female, and units of measurement—into one consistent format.
- Derivation: Creates transformation rules that apply to the data. For example, maybe you need to subtract certain costs or tax liabilities from business revenue figures before analyzing them.
- Aggregation: Gathers and searches data so it can be presented in a summarized report format.
- Integration: Reconciles diverse names/values that apply to the same data elements across the data warehouse, so that each element has a standard name and definition.
- Filtering: Selects specific columns, rows, and fields within a dataset.
- Splitting: Splits one column into more than one column.
- Joining: Links data from more than one source, like adding spend information across more than one SaaS platform.
- Summarization: Creates different business metrics by calculating value totals. For example, you might add up all the sales made by a specific salesperson to create total sales metrics for specific periods.
- Validation: Sets up automated rules to follow in different circumstances. For example, if the first five fields in a row are NULL, then you can flag the row for investigation, or prevent it from being processed with the rest of the information.
Data loading is the process of loading the extracted information into your target data repository. Loading is an ongoing process that could happen through "full loading" (the first time you load data into the warehouse) or "incremental loading" (as you update the data warehouse with new information). Because incremental loads are the most complex, this section will focus on them.
Types of Incremental Loads:
Incremental loads extract and load new information that appeared since the last incremental load. This can happen in two ways:
- Batch incremental loads: Are when the data warehouse ingests information in packets or batches. If it's a particularly large batch, it's best to carry out a batch load during off-peak hours—on a daily, weekly, or monthly basis—to prevent system slowdowns. However, modern data warehouses can also ingest small batches of information on a minute-by-minute basis with an ETL platform like Xplenty. This allows them to achieve an approximation of real-time updates for the end-user.
- Streaming incremental loads: Are when the data warehouse ingests new data as it appears, in real-time. They are particularly valuable when the end-user requires real-time updates for perfectly-timed decisions. That said, streaming incremental loads are only possible when the updates involve a very small amount of data. In most cases, minute-by-minute batch updates offer a more robust solution than real-time streaming.
Incremental Loading Challenges:
Incremental loads can disrupt system performance and cause a host of problems, including:
- Data structure changes: Data formats—in your data sources or data warehouse—may need to evolve according to the needs of your information system. However, changing one part of the system could lead to incompatibilities that interfere with the loading process. To prevent problems relating to inconsistent, corrupt, or incongruent data, it’s important to zoom out and review how small changes affect the total ecosystem. Then you can make appropriate adjustments across the system.
- Processing data in the wrong order: Data pipelines can follow complex trajectories that result in your data warehouse processing, updating, or deleting information in the wrong order. That can lead to corrupt or inaccurate information. For this reason, it’s vital to monitor and audit the ordering of data processing.
- Failure to detect problems: When you don't know about a problem, it invariably gets worse. Therefore, it's important to quickly detect when an API goes down, when your API access credentials are out-of-date, when system slow-downs interrupt dataflow from an API, and when the target data warehouse is down. The sooner you detect the problem, the faster you can fix it, and the easier it is to correct the inaccurate/corrupt data that results from it.
The Technical Aspects of ETL and ELT
It's important to pay close attention to the following when designing your ETL and ELT processes:
- Ensure accurate logging: It's vital to make sure your data system provides "accurate logging" of new information. To ensure accurate logging, you'll need to audit data after loading to check for lost or corrupt files. With proper auditing procedures, you can debug your ETL/ELT process when data integrity challenges arise (as they invariably do).
- Flexibility to work with diverse sources of structured and unstructured data: Your data warehouse may need to integrate information from a lot of incompatible sources like PostgreSQL, Salesforce, Cassandra, and in-house financial applications. Some of this information could lack the data structures required for analysis. You need to design your ETL/ELT process to deal with all forms of data—structured and unstructured alike.
- Stability and reliability: ETL/ELT pipelines get overloaded, crash, and run into problems. Your goal should be to build a fault-tolerant system that can recover after a shutdown so your data can move without getting lost or corrupted even in the face of unexpected issues.
- Designing an alert system: To ensure the accuracy of your business insights, an alert system that notifies you of potential problems with the ETL/ELT process is essential. For example, you’ll want to receive notifications and reports for expired API credentials, bugs related to third-party APIs, connector errors, general database errors, and more.
- Strategies to speed up the flow of data: When data warehouses and BI platforms have access to information that is up-to-date, they offer better, more accurate insights at a moment’s notice. Therefore, it’s important to focus on reducing data latency, i.e., the time it takes for a data packet to move from one area of the system to the next.
- Growth flexibility: Your ETL/ELT solution should be flexible to scale up and down according to your organization’s changing data needs. This will save money on cloud-server processing and storage fees, while providing the ability to scale up as required.
- Support for incremental loading: Using change data capture (CDC) speeds up the ETL process by permitting incremental loading. This lets you update only a small part of your data warehouse while ensuring data synchronicity.
Xplenty: Automated ETL Solutions for Better Analytics
Xplenty offers an array of ETL tools to speed up the process of extracting, transforming, and loading the data your business intelligence solutions require. Our powerful out-of-the-box integrations offer a simple, drag-and-drop interface, so you don't have to be a data engineer to develop advanced ETL pipelines!
Contact the Xplenty team now to learn how Xplenty can help with your data integration needs.