Apache Iceberg

Dremio 19.0+ supports using the popular Apache Iceberg open table format. Iceberg is an open-source standard for defining structured tables in the Data Lake and enables multiple applications, such as Dremio, to work together on the same data in a consistent fashion and more effectively track dataset states with transactional consistency as changes are made.

For information regarding requirements, instructions on enabling table format support, and more, see Using Iceberg Tables with Dremio

Understanding Iceberg Tables

The Iceberg table format has similar capabilities and functionality to SQL tables in traditional databases. Unlike such datasets, Iceberg functions in a fully-open and accessible manner that allows multiple engines (e.g., Dremio, Spark, etc.) to operate on the same dataset.

Via metadata files (i.e., manifests), Iceberg tracks point-in-time snapshots by maintaining all deltas as a table. Each snapshot provides a complete description of the table’s schema, partition, and file information. Additionally, Iceberg intelligently organizes snapshot metadata in a hierarchical structure. This enables Dremio to employ fast and efficient changes to tables without redefining all dataset files, thus ensuring optimal performance when working at data lake scale.

Visualizing Iceberg Components

Iceberg table architecture consist of three layers:

  1. The Iceberg catalog. The catalog is where services go to find the location of the current metadata pointer, which helps identify where to read or write data for a given table. Here is where references or pointers exist for each table that identify each table’s current metadata file.
  2. The metadata layer. This layer consists of three components: metadata file, manifest list, and manifest file. The metadata file includes information about a table’s schema, partition information, snapshots, and the current snapshot. The manifest list contains a list of manifest files, along with information about the manifest files that make up a snapshot. Manifest files track data files in addition to other details and statistics about each file
  3. The data layer. Each manifest file tracks a subset of data files, which contain details about partition membership, record count, and lower- and upper-bounds of columns.

Benefits of Iceberg Tables

Iceberg tables offer the following benefits over other formats traditionally used in the Data Lake, including:

  • Schema evolution. Supports add, drop, update, or rename column commands with no side effects or inconsistency.
  • Optimized processes. Prevents user mistakes that can inadvertently slow queries by utilizing advanced filtering where logically possible.
  • Partition evolution. Facilitates the modification of partition layouts in a table, such as data volume or query pattern changes without needing to rewrite the entire table.
  • Time travel. Allows users to query any previous versions of the table to examine and compare data or reproduce results using previous queries.
  • Version rollback. Corrects any discovered problems quickly by resetting tables to a known good state.
  • Increased performance. Ensures data files are intelligently filtered for accelerated processing via advanced partition pruning and column-level statistics.
  • Transactional consistency. Helps users avoid partial or uncommitted changes by tracking atomic transactions with ACID properties.

Additional Resources

For more information about Apache Iceberg, view the following resources: