Iceberg 101: Working with Iceberg Tables

Apache Iceberg has rapidly become the leading open table format for modern data lakehouses, enabling organizations to combine the cost-effectiveness of data lakes with the reliability and performance of data warehouses. However, successfully implementing and managing Iceberg tables at enterprise scale requires understanding both the technical fundamentals and the operational complexities involved. This guide walks through the essentials of working with Iceberg tables, from creation to optimization.

What is an Iceberg table?

An Apache Iceberg table is a logical table that references columnar data stored in a cloud object store like Amazon S3, alongside relevant metadata. Underlying data files are stored in a columnar format (Parquet, ORC), based on a partitioning scheme defined in the table metadata. The metadata layer is used to track the files in a table along with schemas, partitions, and other table properties. This metadata is stored in manifest files, which contain a list of data files along with each file's partition data. Manifest files are then tracked by a manifest list file, which is referenced by a metadata file that maintains the table's state across multiple versions or snapshots.

To learn more, read our previous articles:

Working with Iceberg Tables: Create, Read, and Update

Iceberg 101: Working with Iceberg Tables Blog

Creating Iceberg tables from raw data can be done using various tools and frameworks, such as Apache Spark, Apache Flink, or Qlik Open Lakehouse. These tools allow you to read data from various sources, apply necessary transformations, and write the processed data into Iceberg tables. Modern platforms offer both zero-ETL approaches to create Iceberg tables directly from raw data sources like S3, Kafka, or databases, handling schema evolution and data partitioning automatically, as well as managed transformation capabilities when needed.

Integrating Iceberg tables into your existing data lake or lakehouse workflows is straightforward. Iceberg tables can be queried using various engines like Snowflake, Databricks, Trino, Presto, Apache Spark, Flink, Dremio, and Amazon Athena. This broad ecosystem support allows you to leverage Iceberg's performance optimizations and schema evolution capabilities while using the query engines best suited for each workload. You can create data pipelines that read from Iceberg tables, perform transformations, and write results back to Iceberg or other destinations, depending on your use case.

Updating and modifying Iceberg tables is supported through SQL engines like Spark SQL and platforms like Qlik Open Lakehouse, which allow you to perform row-level updates and deletes efficiently, thanks to Iceberg's merge-on-read capabilities. You can also evolve the schema of an Iceberg table over time by adding, removing, or renaming columns, without the need to rewrite the entire table. Iceberg handles schema evolution by maintaining a history of schema changes in its metadata layer, making it possible to query across different schema versions seamlessly.

Best Practices for Working with Iceberg Tables

Partitioning: Choosing the right partitioning scheme is crucial for query performance. Iceberg supports hidden partitioning, which allows you to partition data based on a table configuration without the need to specify partition columns in your queries. This enables more flexible partition evolution compared to traditional explicit partitioning in Hive-style tables. When designing your partitioning scheme, consider the most common query patterns and partition by columns frequently used in filters to minimize the amount of data scanned.
Compaction and optimization: Over time, Iceberg tables can accumulate many small files, especially when ingesting streaming data or handling frequent updates. This can negatively impact query performance. Regularly compact your Iceberg tables to merge small files into larger ones, improving scan performance and reducing storage costs. Additionally, consider sorting your data within each partition based on frequently queried columns to enable efficient range scans. The challenge is determining the optimal strategy for each table—cost-based optimization approaches that analyze each table's characteristics often deliver better results than one-size-fits-all manual approaches.
Data retention: Managing data retention is essential to keep storage costs under control and comply with data governance policies. Iceberg provides snapshot expiration and time travel capabilities, allowing you to easily delete old snapshots while maintaining a configurable history of table versions. Implement a retention policy that aligns with your business requirements and automate the process of expiring old snapshots. Don't forget to regularly clean up orphaned files—data files no longer referenced by any snapshot.
Schema evolution and query engine support: Iceberg enables schema evolution, allowing you to add, remove, or rename columns without the need to rewrite the entire table. However, ensure that your query engines support Iceberg's schema evolution features. Most modern engines like Snowflake, Databricks, Spark, and Athena fully support these capabilities, but test your schema evolution workflows with the engines you plan to use and keep compatibility in mind when designing your data architecture.

The Scale Challenge

While Iceberg provides powerful capabilities, working with high volumes of data and managing tables at enterprise scale introduces operational complexity. Ingesting data from hundreds of diverse sources including operational databases, cloud SaaS sources or legacy sources directly into query-ready Iceberg tables, in near real time is challenging in itself; doing so while achieving optimal performance becomes a significant data engineering effort. Continuous maintenance effort is required to comply with best practices around compaction, partitioning adjustments, and metadata cleanups.

For organizations managing hundreds or thousands of tables, manually managing the ingestion pipelines while tuning each table becomes impractical. Different tables have different characteristics and require different optimization strategies. This operational burden demands specialized expertise and ongoing engineering resources, diverting teams from building features and generating business value.

To solve these and other challenges, Qlik has introduced Open Lakehouse as an automated solution for working with Iceberg tables.

Simplifying Iceberg Management with Qlik Open Lakehouse

Qlik Open Lakehouse is a fully managed capability within Qlik Talend Cloud that simplifies building and scaling Apache Iceberg-based lakehouses. With just a few clicks, you can ingest both batch and real-time data from hundreds of sources—including operational databases, SaaS applications, and streaming platforms—directly into Iceberg tables. The platform automatically handles schema mapping, type conflict resolution, and CDC processing for row-level changes. It does not require a data warehouse for ingestion into Iceberg and utilizes cost-effective Amazon EC2 spot instances (with built-in failover mechanisms) delivering up to 50-90% lower costs and compute-burn for data ingestion.

See the Qlik Open Lakehouse demo here:

At its core is the Qlik Adaptive Iceberg Optimizer, which continuously monitors and optimizes tables based on each table's unique characteristics. Rather than requiring manual configuration, it analyzes factors like data profile, update frequency, and query patterns to determine the most impactful optimizations automatically, delivering up to 5x better query performance compared to unoptimized Iceberg tables, with zero manual effort.

The platform integrates with leading Iceberg catalogs such as AWS Glue and supports querying through engines – including Snowflake, Amazon Athena, Amazon Sagemaker Studio, Trino, Presto, and more. Customers can also use Qlik Cloud Analytics with Amazon Athena to effortlessly query data in the lakehouse. Moreover, additional catalog integrations including Snowflake Open Catalog, Apache Polaris (incubating), and Databricks Unity Catalog are coming soon. For organizations with existing data warehouse investments, data warehouse mirroring allows organizations to automatically make Iceberg tables queryable within platforms like Snowflake without duplicating data.

As part of Qlik Talend Cloud, it provides a unified solution for the entire data lifecycle—ingestion, transformation, optimization, quality validation, and governance—eliminating the need to cobble together multiple point solutions.

To learn more, read Qlik Open Lakehouse: Now Generally Available

Ready to get started?

Request a Demo

Why Qlik?

Agentic AI

Data Integration and Quality

Qlik Cloud Analytics

Find a partner

Global System Integrators