Introduction to Iceberg Lakehouses


What it is, key features, and benefits. This guide provides a comprehensive perspective on Lakehouses, its essential building blocks, high-level architecture and key considerations for building your own open data lakehouse.
A data lakehouse is a data management architecture that combines key capabilities of data lakes and data warehouses into a unified platform. It brings the benefits of a data lake, such as low-cost storage and broad data access, and the benefits of a data warehouse, such as data structure, performance, and management features. Lakehouses are increasingly built utilizing open data and open table formats such as Apache Iceberg, Hudi and Delta tables to provide flexibility and interoperability.
Learn more about Qlik Open Lakehouse on Apache Iceberg
Apache Iceberg is an open table format designed to manage large-scale data lakehouses and enable high-performance analytics on open data formats. It allows files to be treated as logical table entities, making it well-suited for lakehouse architectures.
With Iceberg, users can store data in cloud object stores and process/query it utilizing multiple different engines, offering flexibility and interoperability across platforms.
Iceberg supports some key features such as ACID compliance, dynamic partitioning, time travel, and schema evolution, ensuring high performance and data integrity.
Additionally, Apache Iceberg fosters a strong open-source community, making it a reliable, versatile and open solution for modern data management needs.
Learn more about Apache Iceberg here
The lakehouse data platform ensures that data analysts and AI engineers can utilize the most recent and the broadest data sets towards business intelligence, analytics, Gen AI and machine learning. And having one system to manage simplifies the enterprise data infrastructure and allows analysts and data scientists to work more efficiently.
Here we present the key features of data lakehouses and the benefits they bring to your organization.
FEATURE | BENEFIT |
---|---|
Single repository for many applications | This improves operational efficiency and supports multiple use cases on quality data for business intelligence, reporting, AI and ML and other workloads since you only have to maintain one data repository. |
Support for diverse data types | This allows you to ingest, store, process, refine and analyze a broad range of data types and applications, such as IoT data, text, images, audio, video, system logs and relational data. |
Open & standardized formats | Open formats facilitate broad, flexible and efficient data consumption across diverse sets of query and processing engines and programming languages such as Python and R. Many also support SQL. |
Separation of storage & processing | Utilizing open table formats such as Apache Iceberg, open lakehouses allow you to store data in inexpensive cloud-based object stores while using a variety of Iceberg-compatible engines to process and query all of your data. |
Support for end-to-end streaming | You can scale to larger datasets and have more concurrent users. Plus, these clusters run on inexpensive hardware, which saves you money. |
Support for diverse data types | Enables organizations to utilize the same underlying infrastructure for supporting both batch and streaming use cases. For example, Qlik Open Lakehouse provides a high through put ingestion option to bring in data from both batch and streaming sources. |
Concurrent read & write transactions | Multiple users can concurrently read and write ACID-compliant transactions without compromising data integrity. |
Governance mechanisms | Having a single control point lets you better control publishing, sharing and user access to data. |
Historically, you’ve had two primary options for a big data repository: data lake or data warehouse. To support analytics, AI, data science and machine learning, it’s likely that you’ve had to maintain both of these options simultaneously and link the systems together. This often leads to data duplication, security challenges and additional infrastructure expense. Data lakehouses can help overcome these issues.
Attributes | Data Warehouse | Data Lake | Data Lakehouse |
---|---|---|---|
Overview | Data warehouses ingest and hold highly structured and unified data to support specific business intelligence and analytics needs. The data has been transformed and fit into a defined schema. | Data lakes ingest and hold raw data in a wide variety of formats to directly support data science, AI and machine learning. Massive volumes of structured and unstructured data like ERP transactions and call logs can be stored cost effectively. Data teams can build data pipelines and schema-on-read transformations to make data stored in a data lake available for BI and analytics tools. | Data Lakehouse combines the best of data warehouse and data lakes and can eliminate data redundancies, improving data quality while offering lower costs. Utilizing open table formats enable data to be stored cost-efficiently in cloud object stores while being able to be queried or processed with multiple engines. |
Data Format | Closed proprietary format | Open format | Open format |
Type of Data | Structured data, with limited support for semi-structured | All types: structured, semi-structured data, textual data, unstructured (raw)data | All types: Structured data, semi-structured data, textual data, unstructured (raw) data |
Data Access | SQL only; no direct access to files | Open APIs for direct access with SQL, R, Python and other languages | SQL, along with API extensions to access tables and data |
Reliability | High quality - reliable data with ACID transactions | Low quality - becomes a data swamp without data catalogs and the right governance | High quality - reliable data with ACID transactions |
Governance and Security | Fine-grained security and governance at the row/column level for tables | Fine-grained security and governance at the row/column level for tables | Fine-grained security and governance at the row/column level for tables |
Scalability | Scaling becomes exponentially more expensive | Scales to hold any amount of data at low cost, regardless of type | Scales to hold any amount of data at low cost, regardless of type |
Streaming | Partial; limited scale | Yes | Yes |
Query Engine Lock-In | Yes | No | No |
Resources | Learn more about data warehouses
| Learn more about data lakes
| Learn more about lakehouses
|
A data lakehouse typically consists of six key layers as depicted below: the ingestion layer, storage layer, physical data layer, metadata layer, governance/catalog layer, and a query/processing layer.
The section below dives into the details of each of these layers to understand the Lakehouse architecture better.
Ingestion Layer: Offers capabilities to ingest data from various sources into the lakehouse, including batch and real-time data pipelines using change data capture (CDC) or streaming. Should offer capabilities to easily ingest and load high volumes of data in real time to the lakehouse with just a few clicks.
Storage Layer: Stores all types of data (structured, semi-structured, unstructured) in a single unified platform, often using cloud-based object stores like AWS S3, Azure Blob Storage, or Google Cloud Storage. Data can be stored in a raw, transformed or clean business ready buckets with necessary transformation and cleansing.
Physical Data Layer: Open file formats define how a lakehouse writes and reads data. Open file formats focus on efficient storage and compression of data and significantly impacts speed and performance. They define how the raw bytes representing records and columns are organized and encoded on disk or in a distributed file system such as Amazon S3. Some of the more common open file formats for lakehouses include Apache Parquet, Apache Avro and ORC.
Table Formats/Metadata Layer: The differentiating factor between a Data Lake and a Lakehouse is a table format or a table metadata layer. It provides an abstraction layer on top of the physical data layer to facilitate organizing, querying and updating data. Common open table formats include Apache Iceberg, Apache Hudi and Delta Tables, that store the information about which objects are part of a table and enables SQL engines to see a collection of files as a table with rows and columns that can be queried and updated transactionally.
Catalog layer: A catalog refers to a central registry within the lakehouse framework that tracks and manages the metadata of the tables underneath. It essentially acts as a source of truth for where to find the current state of a table, including its schema, partitions, and data locations, allowing different compute engines to access and manipulate lakehouse tables consistently. Examples include AWS Glue catalog, Snowflake open catalog, Polaris, Unity Catalog, Hive Catalog, Project Nessie, and REST catalogs.
Query/ Compute layer: Provides processing power to analyze and query data stored in the storage layer. It may also utilize distributed processing engines like Apache Spark, Presto, or Hive or other Cloud data engines to handle large datasets efficiently. This layer enables users to access and analyze data from the lakehouse using diverse tools and applications like query engines, BI dashboards, data science platforms, and SQL clients.
Get an unbiased, side-by-side look at all the major cloud data lake vendors, including AWS, Azure, Google, Cloudera, Databricks, and Snowflake.