The data lakehouse is an emerging new data repository structure that combines the benefits of both the data warehouse and the data lake. The data lakehouse will allow BI users and data scientists to work on the same sources. It will also make it easier for organizations to implement data governance policies.
What are the Features of a Data Lakehouse?
Until recently, data architects have relied on two main types of data repository:
- Data warehouse: This repository holds structured data relational databases. Input passes through a transformation layer that integrates and cleanses the data before loading it to a destination. Data in the warehouse fits within a defined schema.
- Data lake: These structures hold any data, including unstructured data like images and documents. Data lakes are big, fast, and cheap. The data does not need to fit any particular schema, and the lake does not try to apply a schema. Instead, data owners use the schema-on-read approach, which applies transformations when a person or process requests the data.
Many organizations now have these two structures in tandem, with a big data lake and multiple data warehouses, often with data duplicated between the two.
The data lakehouse attempts to create greater efficiency by creating data warehouses on data lake technology. Storage is fast and cheap, but the lakehouse approach improves data quality and eliminates redundancy. ETL plays a role in the lakehouse structure, providing a pipeline between the unsorted lake layer and the integrated warehouse layer.
Databricks announced this concept in a paper which outlined the following features:
- Transaction support: Lakehouses can handle multiple data pipelines. This means that they support concurrent read and write transactions without compromising data integrity.
- Schemas: Warehouses apply a schema to all data; lakes do not. The lakehouse structure can reason about the application of schemas and standardize a greater volume of data.
- BI and analytics support: Both teams work with a single data repository. The information contained in the lakehouse has passed through a cleansing and integration process, which speeds up analytics. It is also more voluminous and more recently updated than a warehouse, which improves the quality of BI.
- Extended data types: Warehouses can only store structured data. The lakehouse structure provides access to a much broad range of data, including files, video, audio, and system logs.
- End-to-end streaming: Lakehouses support streaming analytics, which facilitates real-time reporting. This is increasingly a must-have for many enterprises.
- Processing/storage decoupling: The data lake structure uses clusters, which run on low-cost hardware. This approach offers very cheap decentralized storage. To further improve efficiency, the lakehouse model decouples processing from storage. This means that the lakehouse might store data in one cluster, but execute queries on a different cluster. It will always aim to maximize available resources.
- Openness: The Databricks version of the lakehouse uses the open standard Parquet. This storage format has a public API that developers can easily access via Python or R.
The other main implementation of the lakehouse principle is Microsoft Azure's Synapse Analytics. This technology is still emerging, so other versions may emerge over time.
What Problems Does a Data Lakehouse Solve?
Data warehouses and data lakes are both extremely popular. The two exist side-by-side in many enterprises without any serious problems. However, there are areas that are candidates for improvement, such as:
- Data duplication: If an organization has both a data lake and several data warehouses, this will create redundancies. At best, this is inefficient. At worst, it may lead to data inconsistencies. A data lakehouse unifies everything, deduplicating data, and creating a single version of truth for the organization.
- High storage costs: Warehouses and lakes both help to reduce storage costs. Warehouses do so by reducing redundancy and integrating disparate sources. Lakes do it by using Big Data file systems like Hadoop and Spark to store data on cheap hardware. The cheapest possible way to store data is to combine these techniques, which is the lakehouse structure's goal.
- Silo between BI and analytics: Business analysts use integrated data sources like a warehouse or data mart. Data scientists work with lakes, using analytics techniques to navigate the unsorted data. The two teams don't have cause to interact, and their work often overlaps or even contradicts each other. With a data lakehouse, both teams are working from the same repository.
- Data stagnation: Stagnation is a major problem in data lakes, which can quickly become data swamps if left untended. Businesses often dump their data into a lake without properly cataloging it, making it hard to know if the data has expired. The lakehouse structure brings greater organization to Big Data and helps to identify data that is surplus to requirements.
- Risk of future incompatibility: Data analytics is still an emerging technology, with new tools and techniques emerging every year. Some of these might only be compatible with data lakes, while others might only work with warehouses. The flexible lakehouse structure means enterprises can prepare for the future either way.
What are the Issues with Data Lakehouses?
Data experts have pointed out some flaws in the lakehouse approach. Most notably:
- Monolithic structure: The all-in-one approach of a lakehouse has some benefits, but it also introduces some issues. Monolithic structures can be inflexible, hard to maintain, and sometimes they can result in poor service for all users. Architects and designers usually prefer a more module approach that they can configure for different use cases.
- Not a substantial improvement over current structures: There's still some doubt about whether lakehouses will really offer much additional value. Critics argue that a lake-warehouse structure, combined with the right automated tools, can deliver similar efficiency.
- Tech isn't there yet: The ultimate vision involves a lot of machine learning and artificial intelligence. These technologies will need to mature further before lakehouses reach their proposed capabilities.