Business intelligence and analytics projects depend on efficient and effective data integration, which in turn depends on processes such as ETL (extract, transform, and load). Rather than analyzing data from multiple sources in place, ETL collects information within a centralized data warehouse for faster and easier processing and querying.
Given a business use case for data integration, ETL architecture defines the various systems, elements, and relationships that will enable this integration. Our previous post discussed the basics of creating an ETL architecture. But what are the specific, fundamental pieces of ETL architecture that you should know about?
In this article, we’ll do a deeper dive into the various moving parts that make up an ETL architecture and identify some of the essential questions and considerations for each one.
The Foundation of ETL Architecture
Every data architect knows that ETL stands for the three basic steps of data integration: extract, transform, and load. Yet this simplistic abbreviation glosses over some of the most important parts of the ETL process.
Below, we’ll examine the question of ETL architecture techniques in greater depth, discussing 6 crucial steps that together compose ETL architecture as they might apply in a real ETL pipeline. Each one of these components should be given careful consideration when building your ETL workflow:
Data profiling is a crucial yet often unsung step in the ETL process. In short, data profiling examines your source data, checking its contents and ensuring that it is appropriate for ETL.
The three components of data profiling are:
- Structure discovery: What are the acceptable values or ranges of values for each field in the data record?
- Content discovery: What information does each data record contain, and what is the quality of this record? How is missing data represented (e.g. with values such as NULL, 0, or -1)?
- Relationship discovery: What are the relationships that exist between the given data sources, and between individual records within a data source? How can you take advantage of these connections during ETL while reducing inefficiencies (e.g. eliminating duplicate data)?
Data Profiling Options
Enterprise-class ETL platforms often come with their own data profiling tool such as IBM InfoSphere Information Analyzer or SAP Business Objects Data Services (BODS). If you aren't using one of these platforms, you can try using a third-party data profiling tool such as Datiris Profiler or Melissa Data Profiler.
While it’s not the first component of ETL architecture chronologically (that honor goes to data profiling), extraction is the first foundational step of every ETL workflow.
Data is extracted from predefined source locations, either in real-time or in sequential batches depending on the use case. Real-time data extraction is important when you need fresh, up-to-the-minute information, while batch data processing is more efficient for handling large quantities of data at once.
OLTP (online transactional processing) databases are some of the most important sources of information for ETL workflows. These databases are optimized to handle large quantities of day-to-day transactions, including additions, edits, and deletions. For example, a bank may use an OLTP database to store records of financial transactions such as transfers and deposits.
In addition to OLTP databases, other important ETL sources include:
- Source files such as CSV, JSON, and XML/HTML files and Excel spreadsheets
- Enterprise software applications, such as CRM systems and financial consolidation and close software
- Cloud SaaS (software as a service) applications, including:
- E-commerce software such as Shopify
- Customer support tools such as Zendesk
- Productivity software such as Google’s G Suite (Gmail, Docs, Drive, etc.)
- Web services, APIs, and feeds
The extraction process will heavily depend on the format of your ETL pipeline. If you're building a streaming ETL pipeline using a tool like Apache Kafka, for example, your pipeline will extract data from relational databases using the Confluent JDBC connector. Each row of the source database will become a key/value pair as part of a message in a Kafka topic. Applications that are subscribed to this topic can then read this information and perform the necessary transformations.
The next ETL step is data cleansing, which is generally lumped in with the “transform” stage of ETL. Data cleansing involves cleaning and preparing the data before it is transformed to fit the target location.
Data cleansing is a critical task to ensure that the resulting insights and analyses are as reliable and accurate as possible. Unfortunately, far too many organizations don’t perform their due diligence when it comes to data cleansing. According to one survey by Experian Data Quality, companies believe that on average, 29% of their data is inaccurate.
The steps of data cleansing include:
- Eliminating duplicate data that can result from ingesting the same information multiple times, skewing the resulting analysis.
- Resolving discrepancies and ambiguities, such as different ways of writing an address.
- Dropping irrelevant data that has no value for BI and analytics.
- Removing outliers in the data that may indicate an error, or that may dramatically affect the final results.
- Identifying missing data, either filling in these missing values or dropping the records if no replacement is available.
As with data profiling, enterprise-class ETL platforms often include built-in data cleansing tools to simplify the process, such as Informatica Data Quality and SAP Agile Data Preparation. You can also write your own data cleansing and validation engine to check that your data conforms to the expected values and eliminate erroneous information.
After data cleansing is complete, the next step is data transformation. This process prepares the extracted information by converting it from source to target formats.
Data transformation is often carried out with domain-specific languages such as XSLT (specifically for XML/HTML documents) and Apache Pig. This latter platform can draw on the power of data processing tools such as MapReduce and Apache Spark, letting users enact advanced data transformations for relatively little effort.
To perform data transformation, you first need an ETL roadmap or outline of how the information will travel from source to target systems. This will give you a better idea of how your data needs to be altered to match the target database schema.
In addition to basic changes such as data cleansing (see the previous section), the transformation stage of ETL also includes more advanced modifications such as:
- Filtering: Selecting only certain rows and columns of the data.
- Aggregating: Linking together data from multiple sources.
- Bucketing: Converting exact numerical data into “buckets” (e.g. converting customer ages into age ranges such as 18-49, 65+, etc.).
- Normalizing: Scaling the data to match a certain range (e.g. between 0 and 1).
Like the extraction step, your choice of ETL transformations heavily depends on how you've chosen to structure your ETL architecture—you can even perform transformations on the data after loading it, which turns your ETL architecture into ELT (extract, load, transform). To continue with our Apache Kafka example, transformations are performed using the Kafka Streams API for real-time parallel data processing.
Loading may be the “simplest” of the three major ETL steps, but you still have some important choices to make for your ETL architecture. In particular:
- What will be the target location of your ETL workflow—a structured database such as a data warehouse or data mart, or an unstructured repository such as a data lake?
- Will you load the data in full each time, or will you only load changes to the existing data in the target location?
- Will you overwrite existing data in the target location, or will you maintain both historical and new data side-by-side? Overwriting data saves on storage and improves performance, but at the cost of losing potentially valuable information.
Depending on your answers to these questions, you have multiple options for how and where to load your data during ETL. With Apache Kafka, the Kafka Connect framework is used to connect with external target locations, such as the Amazon Redshift data warehouse or the Amazon S3 object storage service. Kafka Connect allows you to load data into any number of databases, key-value stores, and/or file systems, as your ETL pipeline requires.
ETL workflows are complicated at the best of times, which means that you need to continuously oversee them to ensure that they’re running at peak performance. Every stage of ETL, from extraction and data cleansing to transformation and loading, represents a potential bottleneck that can slow down your data integration processes.
Any non-trivial ETL implementation needs to include monitoring functionality that will alert you when the system experiences an unexpected drop in performance. ETL monitoring can also inform you when your workflow fails silently—for example, when a given data source is not loaded into the target location, or when it is unintentionally loaded twice.
The best practices for ETL monitoring include having a dashboard that lets you easily oversee your ETL pipeline at a glance. Many enterprise-grade ETL platforms come with pre-built monitoring and dashboard functionality, or you can create your own dashboard that's custom-built for your unique workflow. For example, ETL pipelines with Apache Kafka often use the JMX (Java Management Extensions) technology for managing and monitoring resources.
Whether you build an ETL monitoring tool in-house or use a third-party tool, your dashboard should include the following features:
- Read and summarize system logs, and display the status of each part of the ETL process.
- Display the total amount of data processed and the total amount of data rejected.
- Display the total amount of time to process a given quantity of data.
- Support root cause analysis (RCA) to uncover the source of performance issues by letting users drill down into the relevant logs and information.
The 6 ETL architecture components outlined above must be carefully designed and implemented in order to build a solid foundation for your data integration workflows. With so much to worry about and so many ways to go wrong, building an ETL pipeline from scratch is definitely not for the faint-hearted.
Instead of rolling their own ETL implementation, a growing number of businesses are opting for low-code, easy-to-use ETL platforms like Xplenty. The Xplenty platform comes with more than 100 pre-built integrations with the most common ETL sources and targets. Users can build their own data pipelines using a simple drag-and-drop interface, making your ETL workflow as seamless and worry-free as possible.
Want to learn how Xplenty’s leading data integration platform can make your organization more efficient and productive? Get in touch with our team to discuss your data needs and experience the Xplenty platform for yourself.