What is Data Warehousing?

What is a Data Warehouse?

A data warehouse is a repository containing standardized data from multiple sources. Data is transformed before ingestion into the warehouse, which means that warehouse data is cleansed and ready for relevant business purposes.

Why are Data Warehouses Used?

There are several applications for data warehouses, such as:

  • Analytics and Business Intelligence: A data warehouse offers a consolidated, standardized view of all relevant business data. This can provide a more accurate insight into the state of the business than individual analytics reports run on disparate data sources.
  • Consolidation: Some business functions work better when there is a single, unified data source. Financial reporting and resource planning systems, for example, may need access to data from multiple sources. 
  • Storage: Data warehouses store a copy of production data in a secure location. This can be important for compliance, reporting, and business continuity planning. 
  • System integration: Data warehouses can be bidirectional, outputting data to authorized systems. This can play a role in system integration, where data is transferred from one system to a warehouse, and then retrieved by a different system when required. 

How Does a Data Warehouse Operate?

The typical data warehouse structure consists of two elements: a data warehousing environment and a data pipeline.

Data Warehousing Environment

Data warehouses are not Big Data structures, strictly speaking. However, they can hold enormous quantities of structured data, and this calls for the right environment. 

This can be done with on-premise systems. Some of the most popular options for local hosting are:

For most enterprises, this kind of storage is not economically viable with on-premise servers. Most data warehouses are hosted on a cloud service, which offers a cost-effective and scalable alternative. The most popular cloud storage options include:

In each instance, the important thing is to have the right environment to host significant quantities of data. The environment hosts a large-scale relational database that can scale up without performance issues. The environment must also be compatible with a data pipeline. 

Data Pipeline

data pipeline is a software process that runs through three key stages: Extract, Transform, Load.

  • Extract: Data is obtained from sources. This data will be in a variety of different formats and may contain invalid or duplicated entries. 
  • Transform: Source data is hosted in a staging database. It then passes through several transformations, including harmonization, normalization, and cleansing. Through the use of data mapping, the data is converted so that it fits in the warehouse’s schema.
  • Load: Data is copied from the staging database into the warehousing environment. Because it has been transformed, it will sit neatly in the warehouse schema, regardless of its format at source. 

With an ETL platform such as Xplenty, this process is automated end-to-end. ETL platforms integrate with both the source databases and the data warehouse. Information is pushed from one place to another on a regular schedule, without the need for manual intervention. Administrators can configure the transformation schema as required. 

This process is known as a data pipeline because data flows smoothly from one location to another. 

What is Stored in a Data Warehouse?

A data warehouse is a relational database, usually quite large in scale, hosted in an environment that can efficiently process queries. 

This means that the data warehouse can only be used to store structured data. To clarify the different data types: 

  • Structured data: Information stored in a relational database table. This includes data from most production systems, such as an Enterprise Resource Planning or Customer Relationship Management system.  
  • Semi-structured data: These items have a logical structure other than a relational database table. A CSV file is one example of semi-structured data – it is a text file, but formatted in a way that is easy to import into a database. Semi-structured data can be warehoused if it is correctly pre-processed into structured data. 
  • Unstructured data: Any other form of data, including text documents, images, and audio files. These can not be warehoused without being converted to structured data.
  • Metadata: Metadata might hold information that points to other data, including unstructured data. Because the metadata itself is usually structured or semi-structured, it can easily be warehoused. 

So, for example, a gallery of photos is not normally stored in a data warehouse. But a table of metadata containing the filenames, format, and description for each picture can be warehoused, and thereby used in analytics. 

What’s the Difference Between a Data Warehouse, a Data Mart, and a Data Lake?

Data warehouses are a commonly used form of data repository. They are sometimes confused with other types of repositories, such as data lakes and data marts. But there are some crucial differences between them. 

Data Lake is a Big Data repository. Unlike a data warehouse, a data lake can hold both structured and unstructured data. This is because data does not pass through a transformation stage before ingestion into the lake. This makes the extract-load process much faster and less resource-intensive.

Instead, data is loaded as-is into the lake. Business users trying to navigate the data for analytics purposes must use tools like MapReduce to find what they need. A data lake can’t support production systems and is not ideal for highly volatile data. 

A Data Mart is essentially a smaller data warehouse. While the warehouse might store all of an organization’s information, a data mart only stores information relating to a specific department, project, or objective. 

Data marts can be virtual, which is a specially configured view of the main data warehouse. They can also exist separately on their own server, with their own data pipelines. Some marts might be hybrids of both, with some data drawn from the warehouse, and other department-specific data supplied by an ETL process.