Data Lakehouse

What it is, key features, and benefits. This guide provides a comprehensive perspective on Lakehouses, its essential building blocks, high-level architecture and key considerations for building your own open data lakehouse.

Get the Guide for Lakehouses

Diagram showing data flow from structured, semi-structured, and unstructured data, through gears, to BI, streaming analytics, data science, and machine learning.

DATA LAKEHOUSE GUIDE

What is a Data Lakehouse?Introduction to Iceberg Lakehouses What is Apache Iceberg?Features & Benefits Data Lakehouse vs Data Warehouse vs Data Lake Data Lakehouse Architecture

What is a Data Lakehouse?

A data lakehouse is a data management architecture that combines key capabilities of data lakes and data warehouses into a unified platform. It brings the benefits of a data lake, such as low-cost storage and broad data access, and the benefits of a data warehouse, such as data structure, performance, and management features. Lakehouses are increasingly built utilizing open data and open table formats such as Apache Iceberg, Hudi and Delta tables to provide flexibility and interoperability.

Learn more about Qlik Open Lakehouse on Apache Iceberg

Introduction to Iceberg Lakehouses

Click to play "Introduction to Apache Iceberg Based Open Lakehouse" video via Vidyard.

What is Apache Iceberg?

Apache Iceberg is an open table format designed to manage large-scale data lakehouses and enable high-performance analytics on open data formats. It allows files to be treated as logical table entities, making it well-suited for lakehouse architectures.

With Iceberg, users can store data in cloud object stores and process/query it utilizing multiple different engines, offering flexibility and interoperability across platforms. 

Iceberg supports some key features such as ACID compliance, dynamic partitioning, time travel, and schema evolution, ensuring high performance and data integrity. 

Additionally, Apache Iceberg fosters a strong open-source community, making it a reliable, versatile and open solution for modern data management needs. 

Learn more about Apache Iceberg here 

Data Lakehouse Features and Benefits

The lakehouse data platform ensures that data analysts and AI engineers can utilize the most recent and the broadest data sets towards business intelligence, analytics, Gen AI and machine learning. And having one system to manage simplifies the enterprise data infrastructure and allows analysts and data scientists to work more efficiently.

Here we present the key features of data lakehouses and the benefits they bring to your organization. 

FEATURE	BENEFIT
Single repository for many applications  Lakehouses allow you to access data for AI and machine learning, data science, SQL, and advanced analytics directly on a single repository of clean, integrated source data.	This improves operational efficiency and supports multiple use cases on quality data for business intelligence, reporting, AI and ML and other workloads since you only have to maintain one data repository.
Support for diverse data types  Data lakehouses give you access to process all types of data including structured, semi-structured and unstructured data types.	This allows you to ingest, store, process, refine and analyze a broad range of data types and applications, such as IoT data, text, images, audio, video, system logs and relational data.
Open & standardized formats  Lakeshouses utilize open, standardized data formats such as Parquet, Avro or ORC. Additionally, they also support open table formats such as Apache Iceberg, Hudi or Delta Table.	Open formats facilitate broad, flexible and efficient data consumption across diverse sets of query and processing engines and programming languages such as Python and R. Many also support SQL.
Separation of storage & processing  Unlike data warehouses, lakehouses truly decouple storage and compute by providing the ability to use separate engines and resources for storing and processing data.	Utilizing open table formats such as Apache Iceberg, open lakehouses allow you to store data in inexpensive cloud-based object stores while using a variety of Iceberg-compatible engines to process and query all of your data.
Support for end-to-end streaming  Data Lakehouses enables the handling of real-time streaming data as well as historical or batch data in a single framework.	You can scale to larger datasets and have more concurrent users. Plus, these clusters run on inexpensive hardware, which saves you money.
Support for diverse data types Data lakehouses give you access to structured, semi-structured and unstructured data types.	Enables organizations to utilize the same underlying infrastructure for supporting both batch and streaming use cases. For example, Qlik Open Lakehouse provides a high through put ingestion option to bring in data from both batch and streaming sources.
Concurrent read & write transactions  Data lakehouses can handle multiple data pipelines.	Multiple users can concurrently read and write ACID-compliant transactions without compromising data integrity.
Governance mechanisms  Lakehouses can support strong governance and auditing capabilities.	Having a single control point lets you better control publishing, sharing and user access to data.

Data Lakehouse vs Data Warehouse vs Data Lake

Historically, you’ve had two primary options for a big data repository: data lake or data warehouse. To support analytics, AI, data science and machine learning, it’s likely that you’ve had to maintain both of these options simultaneously and link the systems together. This often leads to data duplication, security challenges and additional infrastructure expense. Data lakehouses can help overcome these issues.

Diagram comparing data storage evolution: Data Warehouse (1980s), Data Lake (2011), and Lakehouse (2020), showcasing the progress from structured to unstructured and BI to real-time analytics and ML.

Attributes	Data Warehouse	Data Lake	Data Lakehouse
Overview	Data warehouses ingest and hold highly structured and unified data to support specific business intelligence and analytics needs. The data has been transformed and fit into a defined schema.	Data lakes ingest and hold raw data in a wide variety of formats to directly support data science, AI and machine learning. Massive volumes of structured and unstructured data like ERP transactions and call logs can be stored cost effectively. Data teams can build data pipelines and schema-on-read transformations to make data stored in a data lake available for BI and analytics tools.	Data Lakehouse combines the best of data warehouse and data lakes and can eliminate data redundancies, improving data quality while offering lower costs. Utilizing open table formats enable data to be stored cost-efficiently in cloud object stores while being able to be queried or processed with multiple engines.
Data Format	Closed proprietary format	Open format	Open format
Type of Data	Structured data, with limited support for semi-structured	All types: structured, semi-structured data, textual data, unstructured (raw)data	All types: Structured data, semi-structured data, textual data, unstructured (raw) data
Data Access	SQL only; no direct access to files	Open APIs for direct access with SQL, R, Python and other languages	SQL, along with API extensions to access tables and data
Reliability	High quality - reliable data with ACID transactions	Low quality - becomes a data swamp without data catalogs and the right governance	High quality - reliable data with ACID transactions
Governance and Security	Fine-grained security and governance at the row/column level for tables	Fine-grained security and governance at the row/column level for tables	Fine-grained security and governance at the row/column level for tables
Scalability	Scaling becomes exponentially more expensive	Scales to hold any amount of data at low cost, regardless of type	Scales to hold any amount of data at low cost, regardless of type
Streaming	Partial; limited scale	Yes	Yes
Query Engine Lock-In	Yes	No	No
Resources	Learn more about data warehouses Learn More about cloud data warehouses Dive into data warehouse automation	Learn more about data lakes Take a deeper look at data lake vs data warehouse Cloud data lake comparison guide	Learn more about lakehouses Guide to Iceberg lakehouses Learn more about Qlik open lakehouse

Data Lakehouse Architecture

A data lakehouse typically consists of six key layers as depicted below: the ingestion layer, storage layer, physical data layer, metadata layer, governance/catalog layer, and a query/processing layer. 

Components of a Lakehouse Architecture

Diagram illustrating data integration with streaming and batch sources feeding into a Delta Lake, refining data through various stages, and utilizing integrations for analytics, machine learning, and storage.

The section below dives into the details of each of these layers to understand the Lakehouse architecture better. 

Ingestion Layer: Offers capabilities to ingest data from various sources into the lakehouse, including batch and real-time data pipelines using change data capture (CDC) or streaming. Should offer capabilities to easily ingest and load high volumes of data in real time to the lakehouse with just a few clicks. 
Storage Layer: Stores all types of data (structured, semi-structured, unstructured) in a single unified platform, often using cloud-based object stores like AWS S3, Azure Blob Storage, or Google Cloud Storage. Data can be stored in a raw, transformed or clean business ready buckets with necessary transformation and cleansing. 
Physical Data Layer: Open file formats define how a lakehouse writes and reads data. Open file formats focus on efficient storage and compression of data and significantly impacts speed and performance. They define how the raw bytes representing records and columns are organized and encoded on disk or in a distributed file system such as Amazon S3. Some of the more common open file formats for lakehouses include Apache Parquet, Apache Avro and ORC. 
Table Formats/Metadata Layer: The differentiating factor between a Data Lake and a Lakehouse is a table format or a table metadata layer. It provides an abstraction layer on top of the physical data layer to facilitate organizing, querying and updating data. Common open table formats include Apache Iceberg, Apache Hudi and Delta Tables, that store the information about which objects are part of a table and enables SQL engines to see a collection of files as a table with rows and columns that can be queried and updated transactionally.
Catalog layer: A catalog refers to a central registry within the lakehouse framework that tracks and manages the metadata of the tables underneath. It essentially acts as a source of truth for where to find the current state of a table, including its schema, partitions, and data locations, allowing different compute engines to access and manipulate lakehouse tables consistently. Examples include AWS Glue catalog, Snowflake open catalog, Polaris, Unity Catalog, Hive Catalog, Project Nessie, and REST catalogs. 
Query/ Compute layer: Provides processing power to analyze and query data stored in the storage layer. It may also utilize distributed processing engines like Apache Spark, Presto, or Hive or other Cloud data engines to handle large datasets efficiently. This layer enables users to access and analyze data from the lakehouse using diverse tools and applications like query engines, BI dashboards, data science platforms, and SQL clients.

Cloud Data Lake Comparison Guide

Get an unbiased, side-by-side look at all the major cloud data lake vendors, including AWS, Azure, Google, Cloudera, Databricks, and Snowflake.

Get the Guide

DataOps for Analytics

Modern data integration delivers real-time, analytics-ready and actionable data to any analytics environment, from Qlik to Tableau, Power BI and beyond.

Real-time data streaming (CDC)

Extend enterprise data into live streams to enable modern analytics and microservices with a simple, real-time, and comprehensive solution.

Explore Data Streaming

Agile data warehouse automation

Quickly design, build, deploy and manage purpose-built cloud data warehouses without manual coding.

Explore Data Warehouse Automation

Managed data lake creation

Automate complex ingestion and transformation processes to provide continuously updated and analytics-ready data lakes.

Explore Data Lake Creation

Learn more about data integration with Qlik

Free Trial Contact Us

Why Qlik?

Make AI Work for Your Business

Data Integration and Quality

Qlik Cloud Analytics

Find a partner

Global System Integrators