Getting the most up-to-date information so your organization can continue to make smart, data-driven decisions is crucial. But there's one big problem: How do you know when your information is out-of-date and your data integration workflow needs to be refreshed?

Instead of addressing this question, many businesses still rely on batch processing, which runs data integration jobs at regular intervals to capture any changes that have happened since the last job. But what if you could get notifications of real-time data set changes so that you only have to update the changed data?

That's where change data capture (CDC) comes in. So what is change data capture, exactly? What are the different types of CDC, and how does CDC integrate with ETL? We have all the answers below.

undefined

Enjoying This Article?

Receive great content weekly with the Xplenty Newsletter!

Octopus

Table of Contents

What is Change Data Capture?

Change data capture comprises the processes and techniques that detect the changes made to a source table or source database, usually in real-time. The changed rows or entries then move via data replication to a target location (e.g. a data warehouse from a provider such as AWS, Microsoft Azure, Oracle, or Snowflake). A company can then use this updated data for business intelligence (BI) and data analytics workflows.

Types of Change Data Capture: Log-Based CDC vs. Trigger-Based CDC

There are two main ways of performing change data capture: log-based CDC and trigger-based CDC.

In log-based CDC, the CDC solution looks at a database's transaction log. This metadata file records all database changes for easier recovery in the event of a database crash. However, you can also use this changelog file to perform CDC; this CDC solution reads the file to discover the changes made to the source system and then performs data replication of the changes to the target datastore.

The pros of log-based CDC include:

  • High reliability: Because database transaction logs aim to help databases recover from failure, their information is highly accurate and reliable.
  • Low latency: The transaction log is separate from the database itself, which makes it easy to monitor changes in real-time.

Integrate Your Data Today!

Try Xplenty free for 14 days. No credit card required.

Octopus

The cons of log-based CDC include:

  • Proprietary formats: Each relational or nonrelational database (e.g. Microsoft SQL Server, MySQL, PostgreSQL, etc.) may have its own proprietary log file format, which makes it harder to build a robust, generic solution.
  • High complexity: Databases with frequent changes will have very large transaction logs, which makes it harder to perform CDC.

In trigger-based CDC, the CDC solution uses database triggers, which are functions that run when another event occurs (e.g. when data loads into the database). For example, MySQL has a BEFORE UPDATE trigger that automatically kicks in before a table update. Database triggers decrease the overhead for extracting changes when doing CDC, but they also add overhead to the source system because they need to run every time the database updates.

However, there is more to change data capture than just whether it's log-based CDC or trigger-based CDC. For example, you can scan the metadata timestamp columns in a database, such as DATE_MODIFIED and LAST_UPDATED, to see which rows have gone through changes since the last CDC run.

You may write a custom CDC script to perform this check, but it may be brittle and stop working if the underlying database schema changes. In addition, this method might not capture any deleted rows.

Another option is to use the diff terminal command to compare the current and previous states of the database. This method can detect deleted rows but is highly resource-intensive because it scans the entire data volume.

Change Data Capture and ETL

The most significant benefit of change data capture is that it can save you a great deal of unnecessary work. Rather than replicating the entire database, it only replicates those records that have changed since the last update. One crucial use case of CDC, therefore, is for ETL.

The ETL process extracts information from one or more data sources (e.g. files, SQL and NoSQL databases, websites, SaaS applications, etc.). It then cleans and transforms it as necessary. Finally, the ETL process loads the data into a target data warehouse or data lake.

Change data capture can save you a lot of time and effort during ETL, and it can also enable real-time ETL so that you always have access to the latest data-driven insights.

How Xplenty Can Help with Change Data Capture

Enjoying This Article?

Receive great content weekly with the Xplenty Newsletter!

Octopus

Need a CDC and ETL solution for your enterprise data? Xplenty is here to help. Xplenty is a user-friendly ETL and data integration platform. With a simple drag-and-drop interface and more than 100 pre-built connectors, the platform helps organizations of all sizes build powerful ETL pipelines.

Using log-based or trigger-based CDC alongside Xplenty, you can detect the database records that need to be updated, and then migrate them into your cloud data warehouse. Want to learn more about why Xplenty is the best ETL and data integration tool on the market? Get in touch with our team of data experts today to talk about your business needs and objectives, or to start your 14-day pilot of the Xplenty platform.