Data Lakehouse

What it is, key features, and benefits. This guide provides a comprehensive perspective on Lakehouses, its essential building blocks, high-level architecture and key considerations for building your own open data lakehouse.

Diagram showing data flow from structured, semi-structured, and unstructured data, through gears, to BI, streaming analytics, data science, and machine learning.

What is a Data Lakehouse?

A data lakehouse is a data management architecture that combines key capabilities of data lakes and data warehouses into a unified platform. It brings the benefits of a data lake, such as low-cost storage and broad data access, and the benefits of a data warehouse, such as data structure, performance, and management features. Lakehouses are increasingly built utilizing open data and open table formats such as Apache Iceberg, Hudi and Delta tables to provide flexibility and interoperability.

Learn more about Qlik Open Lakehouse on Apache Iceberg

What is Apache Iceberg?

Apache Iceberg is an open table format designed to manage large-scale data lakehouses and enable high-performance analytics on open data formats. It allows files to be treated as logical table entities, making it well-suited for lakehouse architectures.

With Iceberg, users can store data in cloud object stores and process/query it utilizing multiple different engines, offering flexibility and interoperability across platforms. 

Iceberg supports some key features such as ACID compliance, dynamic partitioning, time travel, and schema evolution, ensuring high performance and data integrity. 

Additionally, Apache Iceberg fosters a strong open-source community, making it a reliable, versatile and open solution for modern data management needs. 

Learn more about Apache Iceberg here  

Illustration highlighting Apache Iceberg's logo

Data Lakehouse Features and Benefits

The lakehouse data platform ensures that data analysts and AI engineers can utilize the most recent and the broadest data sets towards business intelligence, analytics, Gen AI and machine learning. And having one system to manage simplifies the enterprise data infrastructure and allows analysts and data scientists to work more efficiently.

Here we present the key features of data lakehouses and the benefits they bring to your organization.  

FEATURE

BENEFIT

Single repository for many applications  
Lakehouses allow you to access data for AI and machine learning, data science, SQL, and advanced analytics directly on a single repository of clean, integrated source data. 

This improves operational efficiency and supports multiple use cases on quality data for business intelligence, reporting, AI and ML and other workloads since you only have to maintain one data repository. 

Support for diverse data types  
Data lakehouses give you access to process all types of data including structured, semi-structured and unstructured data types. 

This allows you to ingest, store, process, refine and analyze a broad range of data types and applications, such as IoT data, text, images, audio, video, system logs and relational data.

Open & standardized formats  
Lakeshouses utilize open, standardized data formats such as Parquet, Avro or ORC. Additionally, they also support open table formats such as Apache Iceberg, Hudi or Delta Table.

Open formats facilitate broad, flexible and efficient data consumption across diverse sets of query and processing engines and programming languages such as Python and R. Many also support SQL.

Separation of storage & processing  
Unlike data warehouses, lakehouses truly decouple storage and compute by providing the ability to use separate engines and resources for storing and processing data.

Utilizing open table formats such as Apache Iceberg, open lakehouses allow you to store data in inexpensive cloud-based object stores while using a variety of Iceberg-compatible engines to process and query all of your data.

Support for end-to-end streaming  
Data Lakehouses enables the handling of real-time streaming data as well as historical or batch data in a single framework.

You can scale to larger datasets and have more concurrent users. Plus, these clusters run on inexpensive hardware, which saves you money.

Support for diverse data types
Data lakehouses give you access to structured, semi-structured and unstructured data types.

Enables organizations to utilize the same underlying infrastructure for supporting both batch and streaming use cases. For example, Qlik Open Lakehouse provides a high through put ingestion option to bring in data from both batch and streaming sources.

Concurrent read & write transactions  
Data lakehouses can handle multiple data pipelines.

Multiple users can concurrently read and write ACID-compliant transactions without compromising data integrity.

Governance mechanisms  
Lakehouses can support strong governance and auditing capabilities. 

Having a single control point lets you better control publishing, sharing and user access to data. 

The Iceberg Data Lakehouse Stack

Choosing the Right Building Blocks

Data Lakehouse vs Data Warehouse vs Data Lake

Historically, you’ve had two primary options for a big data repository: data lake or data warehouse. To support analytics, AI, data science and machine learning, it’s likely that you’ve had to maintain both of these options simultaneously and link the systems together. This often leads to data duplication, security challenges and additional infrastructure expense. Data lakehouses can help overcome these issues.

Diagram comparing data storage evolution: Data Warehouse (1980s), Data Lake (2011), and Lakehouse (2020), showcasing the progress from structured to unstructured and BI to real-time analytics and ML.

Attributes

Data Warehouse

Data Lake

Data Lakehouse

Overview

Data warehouses ingest and hold highly structured and unified data to support specific business intelligence and analytics needs. The data has been transformed and fit into a defined schema.  

Data lakes ingest and hold raw data in a wide variety of formats to directly support data science, AI and machine learning. Massive volumes of structured and unstructured data like ERP transactions and call logs can be stored cost effectively. Data teams can build data pipelines and schema-on-read transformations to make data stored in a data lake available for BI and analytics tools.  

Data Lakehouse combines the best of data warehouse and data lakes and can eliminate data redundancies, improving data quality while offering lower costs. Utilizing open table formats enable data to be stored cost-efficiently in cloud object stores while being able to be queried or processed with multiple engines.  

Data Format

Closed proprietary format  

Open format  

Open format  

Type of Data

Structured data, with limited support for semi-structured

All types: structured, semi-structured data, textual data, unstructured (raw)data

All types: Structured data, semi-structured data, textual data, unstructured (raw) data

Data Access

SQL only; no direct access to files

Open APIs for direct access with SQL, R, Python and other languages 

SQL, along with API extensions to access tables and data

Reliability

High quality - reliable data with ACID transactions 

Low quality - becomes a data swamp without data catalogs and the right governance

High quality - reliable data with ACID transactions 

Governance and Security

Fine-grained security and governance at the row/column level for tables 

Fine-grained security and governance at the row/column level for tables

Fine-grained security and governance at the row/column level for tables 

Scalability

Scaling becomes exponentially more expensive

Scales to hold any amount of data at low cost, regardless of type

Scales to hold any amount of data at low cost, regardless of type

Streaming

Partial; limited scale

Yes

Yes

Query Engine Lock-In

Yes

No

No

Resources

Learn more about data warehouses


Learn More about cloud data warehouses


Dive into data warehouse automation

Learn more about data lakes


Take a deeper look at data lake vs data warehouse


Cloud data lake comparison guide

Learn more about lakehouses


Guide to Iceberg lakehouses


Learn more about Qlik open lakehouse

Data Lakehouse Architecture

A data lakehouse typically consists of six key layers as depicted below: the ingestion layer, storage layer, physical data layer, metadata layer, governance/catalog layer, and a query/processing layer. 

Components of a Lakehouse Architecture

Diagram illustrating data integration with streaming and batch sources feeding into a Delta Lake, refining data through various stages, and utilizing integrations for analytics, machine learning, and storage.

The section below dives into the details of each of these layers to understand the Lakehouse architecture better. 

  1. Ingestion Layer: Offers capabilities to ingest data from various sources into the lakehouse, including batch and real-time data pipelines using change data capture (CDC) or streaming. Should offer capabilities to easily ingest and load high volumes of data in real time to the lakehouse with just a few clicks. 

  2. Storage Layer: Stores all types of data (structured, semi-structured, unstructured) in a single unified platform, often using cloud-based object stores like AWS S3, Azure Blob Storage, or Google Cloud Storage. Data can be stored in a raw, transformed or clean business ready buckets with necessary transformation and cleansing. 

  3. Physical Data Layer: Open file formats define how a lakehouse writes and reads data. Open file formats focus on efficient storage and compression of data and significantly impacts speed and performance. They define how the raw bytes representing records and columns are organized and encoded on disk or in a distributed file system such as Amazon S3. Some of the more common open file formats for lakehouses include Apache Parquet, Apache Avro and ORC. 

  4. Table Formats/Metadata Layer: The differentiating factor between a Data Lake and a Lakehouse is a table format or a table metadata layer. It provides an abstraction layer on top of the physical data layer to facilitate organizing, querying and updating data. Common open table formats include Apache Iceberg, Apache Hudi and Delta Tables, that store the information about which objects are part of a table and enables SQL engines to see a collection of files as a table with rows and columns that can be queried and updated transactionally.

  5.   Catalog layer: A catalog refers to a central registry within the lakehouse framework that tracks and manages the metadata of the tables underneath. It essentially acts as a source of truth for where to find the current state of a table, including its schema, partitions, and data locations, allowing different compute engines to access and manipulate lakehouse tables consistently. Examples include AWS Glue catalog, Snowflake open catalog, Polaris, Unity Catalog, Hive Catalog, Project Nessie, and REST catalogs. 

  6. Query/ Compute layer: Provides processing power to analyze and query data stored in the storage layer. It may also utilize distributed processing engines like Apache Spark, Presto, or Hive or other Cloud data engines to handle large datasets efficiently. This layer enables users to access and analyze data from the lakehouse using diverse tools and applications like query engines, BI dashboards, data science platforms, and SQL clients.

Cloud Data Lake Comparison Guide

Get an unbiased, side-by-side look at all the major cloud data lake vendors, including AWS, Azure, Google, Cloudera, Databricks, and Snowflake.

Learn more about data integration with Qlik