In today’s 24/7 digital world, real-time data is a necessity to stay relevant for today’s businesses. Companies who wish to remain competitive must be able to quickly respond to customer demand and adjust to market changes. Supplying business leaders with real-time information for informed decision-making can be a challenge with information spread among disparate systems. A robust ETL process complemented by change data capture techniques is the key to providing the most accurate and up-to-date information. Here, we’ll discuss change data capture and why it is important to your integration efforts.

Table Of Contents

What is Change Data Capture and Why is It Important

Traditionally, the most up-to-date information lives, not in the warehouse, but in the operational systems, employees use each day. Changes or updates made to the operational systems must somehow be reflected in the warehouse. That is where Change Data Capture (CDC) comes in. CDC is the process of providing near-real-time updates of information from operational systems to the warehouse. 

What Tables Use CDC?

When thinking of CDC, it is probably better to view it in the context of its purpose, rather than what tables use it. CDC is a strategy that helps make the ETL process more efficient by ingesting only the changes made since the last ETL process rather than the entire data set. Thus, it can be configured in the ETL process to capture the changes you deem necessary for your integration. 

How to Implement CDC

There are multiple methods for performing CDC each of which has specific advantages and disadvantages. 

Database Dump

The simplest solution involves simply exporting the database and importing it to the data warehouse. This approach is only effective with small datasets that don't change frequently. As the size of the database grows, the export process becomes less efficient which warrants a more robust approach.

Timestamp

The timestamp technique depends on a timestamp field in the source table to determine when to perform an update. One key downside to this method is that it uses the query layer for extracting information, which puts an additional load on the source. Another downside is that it requires recurring polling of the table which can affect latency. Perhaps the biggest downside to this method is that it cannot capture deletions unless the app uses soft deletions.

Diff

The Diff method involves creating a snapshot of the complete extract in the target staging area. The next time information needs to be loaded, a second version of the snapshot is compared to the first snapshot to spot changes. Similar to the timestamp method, this method uses the query layer for extracting data which puts an additional load on the source. Also, the diff method is a compute-heavy operation that could affect the performance of the source. Lastly, the diff method requires significant storage for caching snapshots. 

Log-Based

Most enterprise databases store all transaction information in a log for system recovery. Log-based CDC analyzes these logs to determine when changes have occurred and pushes these updates to the warehouse in real-time. Log-based change capture is an asynchronous process so there is no additional load on the database to identify changes. A major downside to this approach is that log files are often vendor-specific, which makes it difficult to migrate to other database vendors.

Trigger-Based

Triggers are functions used to capture changes based on database events. These functions typically run on specific events such as “BEFORE UPDATE” or “AFTER INSERT.” Triggers run on the database while changes are being made, which ultimately affects the performance of the original operation. Triggers also write changes to a separate set of “trigger tables” which puts an additional load on the system. This approach can be resource-intensive as the code must retrieve changes from the trigger tables. This approach also requires constant polling of the event table. 

Integrate.io’s ETL tool and extensive suite of integrations come with log-based and trigger-based CDC so you can focus on ingesting the most relevant updates. 

How Integrate.io Can Help

Integrate.io’s diverse and feature-rich ETL tool includes over 100 pre-built integrations and CDC. The simple drag-and-drop interface is the perfect no-code/low-code tool to empower everyone in the company to build valuable data pipelines. Sign up for a pilot to see how Integrate.io can enable CDC in your integrations.