Apache Iceberg: The Basics

Choosing the right storage format is crucial for optimizing performance, cost, and flexibility when working with cloud data. While file formats like Apache Parquet and Avro have been popular choices for storing data in data lakes, in recent years a new category called table formats has emerged to provide more management capabilities on top of these files. Among these, Apache Iceberg has been gaining significant adoption and momentum.

So what exactly is Iceberg and why does it matter? Let’s dive in.

What is a table format? How is it different from a file format?

File formats like Apache Parquet define how data is serialized and stored within individual files, focusing on storage efficiency and read performance. In contrast, table formats like Apache Iceberg provide a management layer on top of file formats. They define how a logical table is mapped across many physical data files.

You can think of a table format as providing entity-like semantics similar to tables in a database, but applied to files in a cost-effective cloud object storage. These table formats track schema, partitioning, and file-level metadata to optimize access and management of the underlying data files. Critically though, the table formats themselves are not query engines. Rather, query engines leverage table formats to provide more optimized and feature-rich access to the data.

What is Apache Iceberg?

Apache Iceberg is an open source, high-performance table format that is designed for large analytical tables and datasets. Iceberg is most commonly used to implement a lakehouse architecture. Lakehouses combine the key features of data warehouses, like ACID transactions and SQL queries, with the cost effectiveness, flexibility and scale of data lakes. Iceberg provides the metadata layer to enable warehouse-like semantics on top of data lake storage.

More importantly, Iceberg is optimized for handling both batch and real-time streaming data and offers the flexibility and interoperability that modern data teams demand. It enables users store data cost effectively in any of the Cloud storage engines (Amazon S3, Microsoft ADLS, Google Cloud Storage etc.) and query using any of the processing and analytics engines like Snowflake, Databricks, Apache Spark, Amazon Athena, Dremio, Flink, Presto, Trino, and more...

Icons showing how Iceberg is optimized for handling batch and real-time streaming data As an Apache project, Iceberg is 100% open source and not dependent on any individual tools or data lake engines. It was created by Netflix and Apple and is deployed in production by the largest technology companies and proven at scale on the world’s largest workloads and environments.

Learn more about Apache Iceberg and lakehouses from our webinar: Iceberg Ahead: The Future of Open Lakehouses.

What are the key features of Apache Iceberg?

Iceberg provides a table abstraction layer over data files stored in the lake, which allows users to interact with the data using familiar SQL commands, making it easier to query and manage data in lakehouses.

Iceberg also supports ACID transactions, which ensure data consistency and reliability. With ACID transactions, multiple users can concurrently read and write data to the same table without conflicts or data corruption. This is particularly important in data lake or lakehouse environments where data is constantly being updated and accessed by multiple users and applications.

Iceberg supports schema evolution. As data evolves, Iceberg allows users to easily add, drop, or rename columns in a table without the need to rewrite the entire dataset. This flexibility enables organizations to adapt to changing data requirements and avoid costly data migrations.

Iceberg also provides powerful optimization features that improve query performance for lakehouses. One such optimization is data compaction, which involves merging small data files into larger ones, reducing the number of files that need to be scanned during query execution. Iceberg also supports partition pruning, which allows queries to skip irrelevant partitions, significantly reducing the amount of data that needs to be processed.

Iceberg enables time travel – i.e., users can query historical versions of a table. This is particularly useful for auditing, debugging, and reproducing results. With time travel, users can easily compare data changes over time and track the lineage of their data.

By bringing these warehouse-like capabilities to lakehouses, Iceberg makes them more usable and adaptable – enabling organizations to run complex queries, perform real-time updates, and support diverse workloads, including batch processing, streaming, and machine learning.

Learn more about Apache Iceberg and lakehouses from our webinar: Iceberg Ahead: The Future of Open Lakehouses.

Apache Iceberg vs Alternatives

However, Iceberg is not the only table format for lakehouses. Other candidates include Delta Lake, open sourced by Databricks, and Apache Hudi, originally developed at Uber. Many query engines have also implemented proprietary table formats. However, Iceberg’s open-source approach and rapidly growing ecosystem have made it a leading standard.

How Iceberg Tables Work

Apache Iceberg tables are logical tables that reference columnar data stored in cloud object stores like Amazon S3, along with associated metadata. The underlying data is stored in columnar formats such as Parquet or ORC, organized according to a partitioning scheme defined in the table metadata. Iceberg uses a sophisticated metadata layer to track the files in a table, including schemas, partitions, and other properties.

Iceberg tracks data in a table in two levels. First, a central metadata store tracks the table schema and partitioning. Second, Iceberg tracks every data file in a table, along with file-level stats and partition information. This detailed metadata powers Iceberg’s advanced features.

Iceberg tables can be created, read, and updated using various tools and frameworks such as Apache Spark, Apache Flink, or the Qlik Talend Cloud platform. These tables can be easily queried using engines like Amazon Athena, Trino, Presto, Spark, Dremio, and cloud data warehouses like Snowflake, Databricks and BigQuery, allowing users to leverage Iceberg’s performance optimizations and schema evolution capabilities.

Benefits of the Iceberg Table Format

Open source and open standards promote flexibility and interoperability. Iceberg is now supported by an expanding ecosystem of tools such as Snowflake, Databricks, Apache Spark, Flink, Trino, Presto, Hive, and others. while avoiding vendor lock-in.
ACID transactions provide atomic, isolated table updates and deletes. Iceberg uses an optimistic concurrency model to implement transactions across concurrent engines, allowing multiple writers and readers to safely work on the same table without conflicts.
Support for schema evolution makes it easy to change the table schema while maintaining compatibility with existing data. Iceberg handles schema mapping and stores full schema history.
Hidden partitioning automatically maps data to partition values based on the table configuration. Partition values don’t need to be stored in the data files themselves. This simplifies data layout and enables easy partition evolution.
Time travel allows querying historical table snapshots and rolling back tables. Iceberg’s metadata store tracks every version of a table, allowing recreation of the table at any point in time.
Partition pruning and file-level stats dramatically reduce the amount of data scanned per query. Iceberg tracks min/max stats per column for each file, allowing filtering of unnecessary partitions and files.

Building and Managing Iceberg Lakehouses

While Iceberg provides a powerful foundation, managing and optimizing Iceberg tables at scale still requires significant operational overhead. This is where Qlik (together with our recent acquisition of Upsolver) comes in. Qlik Talend Cloud integrated with the Upsolver engine offers a fully managed end-to-end platform that automates the ingestion, transformation, governance and optimization of Iceberg tables.

With Qlik, you can easily ingest real-time and/or batch data from hundreds of diverse sources and land it directly in auto-optimized Iceberg tables. Qlik automatically handles low latency ingestion, table creation, schema updates, partitioning, compaction, and more with just a few clicks. It also provides a visual no code interface for building, managing and transforming your data within Iceberg tables without any infrastructure to manage.

In addition, Qlik offers Adaptive Iceberg Optimizer which can analyze and optimize Iceberg tables to provide up to 5x better query performance and 50% lower storage footprint and costs, compared to self-managed Iceberg implementations. Our adaptive optimizer compacts small files, optimizes file sizes, does dynamic partitioning and cleans up stale metadata to keep query performance high and costs low.

By combining the power of Iceberg with Qlik Talend Cloud’s real-time ingestion, Iceberg automation and management capabilities, organizations can implement a highly optimized lakehouse architecture with minimal operational overhead. This allows data teams to focus on deriving insights rather than managing infrastructure.

Learn More

Join us for Qlik Connect 2025, from May 13-15^thfor engaging sessions and in-depth insights on how to build using Apache Iceberg.

Here are some of the key sessions you don’t want to miss.

Iceberg Ahead: Build an Open Lakehouse with Qlik Talend Cloud and Apache Iceberg
Revolutionizing Iceberg Ingestion: Qlik Talend® and Upsolver in Action
Workshop on Apache Iceberg: Unleash the Power of the Open Lakehouse Table Format
Unleash the Power of Apache Iceberg: Building Your AI and Analytics platform with Qlik Cloud® on AWS

In this article:

Data Integration

Ready to get started?

Request a Demo

Why Qlik?

Make AI Work for Your Business

Data Integration and Quality

Qlik Cloud Analytics

Find a partner

Global System Integrators