Dremio 19.0+ supports using the popular Apache Iceberg open table format. Iceberg is an open-source standard for defining structured tables in the Data Lake and enables multiple applications, such as Dremio, to work together on the same data in a consistent fashion and more effectively track dataset states with transactional consistency as changes are made.
Understanding Iceberg Tables
The Iceberg table format has similar capabilities and functionality to SQL tables in traditional databases. Unlike such datasets, Iceberg functions in a fully-open and accessible manner that allows multiple engines (e.g., Dremio, Spark, etc.) to operate on the same dataset.
Via metadata files (i.e., manifests), Iceberg tracks point-in-time snapshots by maintaining all deltas as a table. Each snapshot provides a complete description of the table’s schema, partition, and file information. Additionally, Iceberg intelligently organizes snapshot metadata in a hierarchical structure. This enables Dremio to employ fast and efficient changes to tables without redefining all dataset files, thus ensuring optimal performance when working at data lake scale.
Visualizing Iceberg Components
Iceberg table architecture consist of three layers:
- The Iceberg catalog. The catalog is where services go to find the location of the current metadata pointer, which helps identify where to read or write data for a given table. Here is where references or pointers exist for each table that identify each table’s current metadata file.
- The metadata layer. This layer consists of three components: metadata file, manifest list, and manifest file. The metadata file includes information about a table’s schema, partition information, snapshots, and the current snapshot. The manifest list contains a list of manifest files, along with information about the manifest files that make up a snapshot. Manifest files track data files in addition to other details and statistics about each file
- The data layer. Each manifest file tracks a subset of data files, which contain details about partition membership, record count, and lower- and upper-bounds of columns.
Benefits of Iceberg Tables
Iceberg tables offer the following benefits over other formats traditionally used in the Data Lake, including:
- Schema evolution. Supports add, drop, update, or rename column commands with no side effects or inconsistency.
- Optimized processes. Prevents user mistakes that can inadvertently slow queries by utilizing advanced filtering where logically possible.
- Partition evolution. Facilitates the modification of partition layouts in a table, such as data volume or query pattern changes without needing to rewrite the entire table.
- Time travel. Allows users to query any previous versions of the table to examine and compare data or reproduce results using previous queries.
- Version rollback. Corrects any discovered problems quickly by resetting tables to a known good state.
- Increased performance. Ensures data files are intelligently filtered for accelerated processing via advanced partition pruning and column-level statistics.
- Transactional consistency. Helps users avoid partial or uncommitted changes by tracking atomic transactions with ACID properties.
Preview access to Apache Iceberg is available with the following versions of Dremio:
- Dremio v18.0 - ADLS sources
- Dremio v19.0 - ADLS, Hive, S3, and HDFS sources
- Dremio v20.0 - Glue sources
To use this new functionality, enable the following support keys with your Dremio installation:
dremio.iceberg.enabled- Enables the overall Apache Iceberg functionality.
dremio.execution.support_unlimited_splits- Enables data splits numbering above 60,000.
Promoting the Iceberg Table
During table promotion, Dremio automatically identifies newly-created tables using the Apache Iceberg format and refers to the most recent Iceberg snapshot to define the table’s schema, including: column names, column types, and partitions.
Capturing Iceberg Snapshots
After table promotion, Dremio automatically updates the table’s schema and metadata information to reflect new snapshots as the Iceberg table is updated.
Currently, new Iceberg snapshots are identified within 60 seconds of being written. To force a recently-written snapshot to be identified immediately, a manual metadata refresh may be run against the table using the following SQL command:
ALTER TABLE <table_name> REFRESH METADATA
Alternatively, the 60-second time interval can be adjusted by enabling the following support key with the desired time internal. However, for sources using the HDFS Iceberg catalog, We recommend against setting this value below 60 seconds.
Understanding Iceberg Catalogs
The Apache Iceberg table format utilizes an Iceberg catalog service to track snapshots and ensure transactional consistency between tools. Dremio sources use the following catalogs by default:
|Dremio Source Type||Default Iceberg Catalog|
|Hive||Hive Iceberg Catalog|
|AWS S3 / ADLS / GCS / Glue / HDFS||Hadoop/Glue Iceberg Catalog|
The current implementation of Apache Iceberg with Dremio does not support the following functionalities:
- Iceberg tables containing file formats other than Parquet
- Tables written using Spark 2
- Partition transforms (e.g.,
- Partition evolution
For more information about Apache Iceberg, view the following resources: