What is Data Lineage?

Data lineage is the name for a specific type of metadata that contains the full history of its subject. Lineage metadata describes the origin of the data to which it refers, and it provides details of any operations since inception. 

Why is Data Lineage Important?

Data rarely stays in one place, especially in a modern enterprise environment. Data can be copied from one platform to another, merged with other data sources, subjected to data cleansing programs, and processed via ETL.

Even within a single system, data can be altered by queries. The end result is that a piece of data may have gone through multiple transformations before arriving at its ultimate destination, and any of these transformations could impact the validity of the data.

Data lineage works as a kind of changelog for this data, recording every operation that has taken place. This can be useful for:

  • Auditing: During an audit, data lineage will clarify where data came from and how it has been transformed since initiation.
  • Compliance: Some organizations may be required to store data lineage metadata to meet compliance obligations, such as those arising from GDPR.
  • Quality control: If there is a loss of quality or an error during ETL, data lineage will help to pinpoint where the problem occurred.
  • Activity monitoring: Data lineage documents the points at which data is being queried or amended. This can help to identify system dependencies or flag up unauthorized activity.

Data lineage is often an essential element of data governance strategies in large organizations.

What is the Process for Data Lineage?

Basic lineage data can be gathered manually and stored in a spreadsheet or document. For more substantial enterprise data sources, lineage data is captured automatically by functions included in systems like SAS and Informatica, or dedicated tools such as Octapai.

In all instances, data lineage has to include some crucial details:

  • What is the nature of the data? In particular, data lineage much capture the privacy level of the data, so that it is easy to tell sensitive information (such as customer or employee data) from non-sensitive information (such as product information.)
  • When was the data created or amended? Each transaction should be timestamped in a unified format so that transaction dates can be easily analyzed. 
  • Who has performed operations on the data? The source of the action must be identified, although this might not refer to a person. It could also refer to a system process or an API call. 
  • Why is this data being stored? This question is especially important in the context of regulations such as GDPR, as personal data must be stored only for a legitimate business purpose.
  • How is this data being used? This should outline all applications that have a dependency on this data, as well as any reports that include data in their results.