A data repository is a structure consisting of one or more databases, containing data for the purpose of analysis.
Data repositories are used in business to provide a centralized source of information. Such a repository might be used by business units to run reports or be used by analytics teams to study performance. Data repositories are also popular in academia, where they provide a reliable corpus of information to scientists and researchers.
A data repository may also be referred to as a data library or a data archive.
What is the Structure of a Data Repository?
A data repository can have any structure that is suitable for the relevant business requirements.
Common structures include:
- Relational databases: A single relational database can act as a repository if required. For example, an organization may choose to replicate a live database to analyze data as it appears in production. A live database – one that is being regularly updated – is not suitable for use as a repository.
- Data warehouse: A warehouse is a repository that unifies data from multiple sources. Often, this data will pass through an Extract, Transform, Load (ETL) layer that integrates and harmonizes the data. This makes it easier to analyze and run reports on the contents of the repository.
- Data mart: A data mart is essentially a smaller data warehouse. Data marts are driven by a specific business purpose, so the data in this repository is only that information relevant to a particular department. For example, a marketing data mart might only contain marketing data.
- Data lake: Data lakes are structures for vast quantities of non-integrated data from multiple sources. They're commonly associated with Big Data and can hold structured, semi-structured, and unstructured data. Analyzing a lake-style repository requires different tools, such as Hadoop and MapReduce.
- Metadata repository: This is any kind of repository that exclusively holds metadata, which is data that refers to other data. Metadata can be used to analyze broad trends or to keep track of the location of other datasets.
- Data cubes: A data cube holds snapshots of a database at multiple points in time. This structure is used to track variations in data and capture sequences of events.
How are Data Repositories Built?
Organizations can choose to build a data repository in any way that suits their needs. In general, the process will go along these steps.
1. Define the data repository requirements
Each repository exists to meet certain requirements, such as a specific business objective. If the goal is more detailed performance analytics, then the analytics team will need a repository of performance data. If the goal is better financial reporting, the repository must hold all financial data. The repository should hold all data required by the end-user to achieve their objective.
2. Look for suitable existing repositories
In some instances, a suitable repository may already exist within the organization. A wide-ranging analytics project may use the company's data lake, while specific departments may have their own existing data marts. If these repositories can fulfill the objectives, there is no need to build a new structure. Otherwise, the data team will start building a new repository.
3. Identify relevant data sources
Structures like data marts and data warehouses will draw from multiple, disparate sources. The team who are building the repository will start by identifying all data sources and mapping each source's schema.
4. Create a target repository
The data team will consider the structure most suited to both the project's goals and the nature of the data sources. They will usually pick one of the structures listed above. This may involve a physical implementation, such as deploying a new database or purchasing some additional cloud hosting.
5. Apply a transformation schema
Extract Transform Load (ETL) is the most reliable method for importing disparate sources into a repository such as a data mart or data warehouse. ETL transforms incoming data by applying a single schema. The result is clean, reliable data that is easy to analyze. For large structures like data lakes, the data is loaded without being transformed.
6. Audit repository data
The repository must provide data that is:
- Relevant to the objectives of the repository
Before deployment, the data team will perform a quality audit to ensure that data meets the required standard. If the examination fails, they will review the ETL process and make changes where required.
7. Test security measures
Repositories introduce a certain degree of risk as they represent a single point of failure. If an unauthorized person gains access to the repository, that person may have access to all of the organization's data. That's why security is a vital consideration at every step of this process. The data team will perform usually perform a final security audit before deploying the data repository.
8. Make available to business users
Once it's up and running, the repository is delivered to the end business users. They will test the repository's performance according to their requirements, and they'll provide feedback if anything needs to change. Project sign-off usually occurs when business users have confirmed that they are up and running.
9. Monitor and maintain
Data repositories are not live databases in the production sense. However, they are constantly refreshed by data pipelines, and the data contained in the repository must always be timely. The data team will keep monitoring the repository over the course of its lifetime, and they will resolve any security or performance issues as they arise.