Data synchronization is the process of maintaining consistency between discrete data sets. This can be a unidirectional process, with one master version of the data, or multi-directional, where all versions of data are updated simultaneously.
Why is Data Synchronization Useful?
Data replication is the process of copying data to another location, resulting in two or more copies of the data.
These versions are not structurally dependent on each other. This means that if one version of the data is updated, it will become inconsistent with all other versions.
To ensure consistency across all versions, there needs to be a system for updating all other instances of the data. This is known as data synchronization, and it is used in several situations.
Synchronization is essential for maintaining consistency between data sources. When one source is updated, the updates are mirrored on all other sources.
For example, a customer address might exist in a number of different places in an organization's database: the CRM, the billing system, the order fulfillment system, and the customer's e-commerce account. If the customer logs into the e-commerce system and changes their address, there needs to be a data synchronization process that changes the address in all other systems.
Data synchronization is an essential element of cloud computing and distributed systems, where data can exist in multiple places. Synchronization is essential to ensure that users always have access to the most recent version of data, as well as guaranteeing that their updates will always be saved.
A common example of this is with cloud drives such as DropBox and OneDrive. Users of these services can create a document on one device, save it in their cloud drive, and open it on a different device. Each time they commit changes to the document, these are stored on the cloud server. This server then forces an update on all connected devices, replacing any older versions with the latest copy.
Storage and Analysis
Data replication is often performed so that data can be safely stored in a repository such as a data warehouse. This use case might not require real-time synchronization. However, the data must be relatively recent, which calls for a synchronization routine, which is often powered by an ETL pipeline.
For example, data might be warehoused for backup purposes. In the event of a disaster recovery scenario, the business will require an up-to-date snapshot of their production data. If they keep their live data and their backups synced regularly, they won't experience substantial data loss.
Synchronization can include major changes, such as amendments to the structure of a relational database. Depending on how the process is implemented, it may be possible to add tables, drop tables, and rename columns.
This is important, as data structures can change quite suddenly. For instance, GDPR brought in new requirements to ask users about their cookie preferences. These preferences had to be stored, which usually meant a new database column, if not an entirely new table. Structural changes of this kind have to be cascaded across the network to all instances of the database.
How Does Data Synchronization Work?
Data synchronization can be done in any number of ways, from manual updates to Python scripts triggered by database changes, to a fully automated data pipeline using ETL.
In all instances, data synchronization follows these steps:
Update Event is Triggered
This can happen in a number of ways. A flag might be set within the table, for example, or a script might regularly check the last modified date of a file. In all cases, the data synchronization process detects that there has been a change to one instance of the data.
Changes are Identified and Extracted
Synchronization doesn't mean full replication. Therefore, the synchronization process only needs to identify where changes have been made. This is done by version compares, by checking changelogs, and by looking for flags that indicate a new value.
Pass Changes to Other Sources
There are two ways of scheduling the movement of data:
- Asynchronous: Changes are transmitted according to a set schedule, such as once a day or once an hour. This is resource-efficient but means that discrepancies may arise between updates.
- Synchronous: When a change occurs, it forces the synchronization process to run. This is more resource-intensive, but it allows for real-time updates of data.
This data transfer might be performed by a web process or file transfer. When using an ETL platform, updates are processed automatically in the background without needing manual intervention.
Parse Incoming Changes
The two instances of data might not be identical. One instance might have a different table structure or be integrated with other data. A data warehouse, for example, will combine data from multiple disparate sources. Incoming data must pass through a transformation layer, which will include cleansing and harmonization.
Apply Changes to Existing Data
Incoming changes are written to the other data source. There are a number of ways to apply these changes, such as:
- Transactional: Changes are applied one-by-one in the same order that they initially occurred. This has the benefit of ensuring that every instance of data has the same local change history.
- Snapshot: Changes are applied in aggregate. This ensures that all data is identical in the end, but only the original version has the full change history.
- Merge: If changes occur on both sides, and neither version is marked as being definitive, then the changes are merged. This means that both instances of the data are updated to reflect all changes.
The goal of this process is to ensure that each instance of data is correctly updated without any loss.
Confirm Successful Update
Finally, the updated system will confirm that the update was a success. This can be done in a number of ways. For example, if the update is done by API, the API will return a message to confirm the update.
This confirmation tells the update process that the update is complete. If no such message is received, then the process will attempt the update again, or else it will return an error message.