Data transfer is the process of copying data from one location to another. The transferred data may be transformed in transit, or arrive at its destination as-is.
When the transfer process results in two copies of the data, this is known as data replication. If the original data source is to become obsolete, it’s called data migration.
How Is a Data Transfer Performed?
A data transfer involves at least two steps. First, data is obtained from the original source, which is called extraction. After that, the data is written to the target destination, a process known as loading. These steps can be performed manually or automatically.
Manual Data Transfer
For one-off jobs, data owners may choose to do a manual data transfer. The process for doing so depends on the nature of both the source and destination. Some options include:
- API call: Many systems have a set of APIs that allow data retrieval. Data is usually exported as a file, such as a JSON, XML, or CSV file.
- Manual export: Some legacy systems might only allow data export through a built-in export function. The output will typically be a semi-structured file, such as CSV.
- Coding: In some instances, there might be a need to write a small application to pull data from a data source. This application will often be written in Python or R.
The output file is transferred to a location where it is accessible by the destination database. If the export file is to leave the organization’s security perimeter, this transfer must be done in a way that complies with security best practices.
Manual transfers can be automated to an extent, using batch files and Cron jobs. True automation generally requires an ETL (Extract, Transform, Load) platform.
Automated Data Transfer
A data pipeline is a software process that automatically transfers data from source to destination. ETL platforms are often used to implement data pipelines.
The data pipeline is integrated with the data sources, often using the ETL platform’s built-in library of integrations. Extracted data is passed through a transformation layer, ensuring that the transported data is compatible with the destination structure. Transformation can also remove invalid data from the transfer.
Finally, data is loaded to the destination. This can be done in two ways.
- Asynchronous transfer: Data transfer happens on a regular schedule. Usually, the transfer job is set to run at night or whenever the network is at its least busy. This is the most resource-efficient approach, but it means that data is not always in sync between source and destination.
- Synchronous transfer: Data is transferred whenever the source is updated. The two databases are synced in real-time, which means that the destination always holds timely data. This method can be more resource-intensive.
An ETL-driven data pipeline may have a mix of synchronous and asynchronous transfers, with different schedules for each source.
What Are the Most Important Considerations in Data Transfer?
Any data transfer comes with a certain degree of risk. There is the risk of data loss, the risk of data corruption, and potential exposure to third parties.
When planning a data transfer, the organization must consider the following:
Data is at its most vulnerable when it is in transit between locations, especially if it is traveling outside of the organization’s security perimeter. The file could be intercepted by a third party, who could extract sensitive information from the export.
In a manual transfer, the export file should always be stored in a secure location, such as a cloud storage facility. Automated transfers, such as those done by ETL, do not expose data at any point during transit.
Data must be available to all users and processes when required. This means that the destination must be updated according to a schedule that suits business needs. The source must also remain available while in use.
When planning a data transfer, the data team must consider the user requirements at both source and destination. Asynchronous transfers generally have the least impact on the performance and availability of source data. However, if the users at the destination need real-time data, then synchronous transfer might be used instead.
Any kind of regular data transfer must follow the schedule reliably. If the data is being used in production, a schedule disruption might cause a system failure. If the data is being archived, a disruption might result in data loss at the destination repository.
Automated data transfers are generally preferred for regular transactions, for this reason. A data pipeline powered by ETL will run in the background according to schedule and send a report if issues arrive. Manual transfers are more likely to go wrong and cause data loss.
Every data transfer incurs a cost in terms of resources.When using a cloud service such as AWS, there's also a financial cost for data transfers between services. Data transfer best practice is to reduce this cost as much as possible by transferring in the most efficient way possible.
Automation can help make data transfers more efficient, and the right mix of synchronous and asynchronous jobs can help maximize resources further. The challenge is to find the most efficient solution while also maintaining security, reliability, and availability.
Latency can be an unpredictable factor in database architecture, as data transfer speeds can vary according to factors such as network conditions. The impact of latency can be mitigated with careful design and attention to infrastructure issues, such as low bandwidth.
Latency can be an even bigger issue when working with Big Data. It's important to use a data structure that minimized the transfer distance and reduces the number of network hops required, so that data can move as quickly as possible.
Data transfer may create two persistent copies of data. In some cases, this might be a requirement – for example, when archiving production data, or when sharing data between systems that aren’t otherwise integrated. However, this can be inefficient if there is no requirement for a second copy of the data.
This is an issue of good data governance. The project stakeholders should have a clear understanding of the data requirements on both sides of the pipeline. If the destination doesn’t need an extant copy of the source data, then only a partial transfer is required. If one version of the data becomes obsolete, it should be immediately deleted.
Transferring data can have compliance implications, especially when transferring personal information. This kind of data is covered by laws such as CCPA and GDPR, which govern how data can be processed and transferred. You may not be able to transfer personal data outside of your network or across international borders.
Transfers might sometimes involve an intermediate stage that can have compliance implications. For example, if you transfer EU data via an ETL platform based outside of Europe, you might be breaching GDPR. Make sure your provider is compliant with all relevant laws. Xplenty operates in accordance with GDPR, CCPA, and most privacy laws that may impact U.S.-based businesses.