On this page

    Apache Iceberg

    Apache Iceberg is an open table format designed for gigantic, petabyte-scale tables and is rapidly becoming an industry standard for managing data in data lakes. A table format helps you manage, organize, and track all of the files that make up a table. Iceberg was created to solve challenges with traditional file formatted tables in data lakes including data and schema evolution and consistent concurrent writes in parallel. Additionally, Iceberg introduces new capabilities that enable multiple applications, such as Dremio, to work together on the same data in a transactionally consistent manner and defines additional information on the state of datasets as they evolve and change over time.

    Dremio integrated support for version 2 Iceberg tables to leverage powerful SQL database-like functionality through industry standard methods for data lakes. Additionally, Dremio supports the time travel and metadata capabilities available in Iceberg tables.

    Note:

    Apache Iceberg is GA in Dremio starting with v21.0. If you are using previous versions of Dremio, preview access to Apache Iceberg was implemented in the following versions:

    • Dremio v18.0 - ADLS sources

    • Dremio v19.0+ - ADLS, Hive, S3, and HDFS sources

    • Dremio v20.0 - ADLS, Hive, S3, HDFS, and Glue sources

    Note:
    For the syntax of the SQL commands that Dremio supports for Iceberg tables, see SELECT.

    For the syntax of the SQL commands that Dremio supports for Iceberg tables, see SQL Commands for Apache Iceberg Tables.

    For a deeper dive into Apache Iceberg, see:

    How Dremio Uses Iceberg Tables

    The Iceberg table format has similar capabilities and functionality to SQL tables in traditional databases. Unlike such datasets, Iceberg functions in a fully-open and accessible manner that allows multiple engines (such as Dremio, Spark, etc.) to operate on the same dataset.

    Via metadata files (that is, manifests), Iceberg tracks point-in-time snapshots by maintaining all deltas as a table. Each snapshot provides a complete description of the table’s schema, partition, and file information. Additionally, Iceberg intelligently organizes snapshot metadata in a hierarchical structure. This enables Dremio to employ fast and efficient changes to tables without redefining all dataset files, thus ensuring optimal performance when working at data lake scale.

    The Iceberg table’s architecture consist of three layers:

    1. The Iceberg catalog: The catalog is where services go to find the location of the current metadata pointer, which helps identify where to read or write data for a given table. Here is where references or pointers exist for each table that identify each table’s current metadata file.
    2. The metadata layer: This layer consists of three components: metadata file, manifest list, and manifest file. The metadata file includes information about a table’s schema, partition information, snapshots, and the current snapshot. The manifest list contains a list of manifest files, along with information about the manifest files that make up a snapshot. Manifest files track data files in addition to other details and statistics about each file.
    3. The data layer: Each manifest file tracks a subset of data files, which contain details about partition membership, record count, and lower- and upper-bounds of columns.

    Benefits of Iceberg Tables

    Iceberg tables offer the following benefits over other formats traditionally used in the data lake, including:

    • Schema evolution: Supports add, drop, update, or rename column commands with no side effects or inconsistency.
    • Partition evolution: Facilitates the modification of partition layouts in a table, such as data volume or query pattern changes without needing to rewrite the entire table.
    • Time travel: Allows users to query any previous versions of the table to examine and compare data or reproduce results using previous queries.
    • Increased performance: Ensures data files are intelligently filtered for accelerated processing via advanced partition pruning and column-level statistics.
    • Transactional consistency: Helps users avoid partial or uncommitted changes by tracking atomic transactions with atomicity, consistency, isolation, and durability (ACID) properties.
    • Version rollback: Corrects any discovered problems quickly by resetting tables to a known good state.
    • Version rollback: Corrects any discovered problems quickly by resetting tables to a known good state.
    • Table optimization: Optimize query performance to maximize the speed and efficiency with which data is retrieved.
    • Table optimization: Optimize query performance to maximize the speed and efficiency with which data is retrieved.

    Promoting the Iceberg Table

    During table promotion, Dremio automatically identifies newly-created tables using the Apache Iceberg format and refers to the most recent Iceberg snapshot to define the table’s schema, including column names, column types, and partitions.

    Iceberg Catalogs in Dremio

    The Apache Iceberg table format uses an Iceberg catalog service to track snapshots and ensure transactional consistency between tools. For more information about how Iceberg catalogs and tables work together, see Iceberg Catalog.

    note:

    Currently, Dremio does not support the Amazon DynamoDB nor JDBC catalogs. For additional information on limitations of Apache Iceberg as implemented in Dremio, see Limitations.

    Once you create a table using an Iceberg catalog, you cannot access that table from a different Iceberg catalog. Essentially, the catalog is the source of truth for the current metadata pointer for the table. For example, if you create a table using the AWS Glue catalog with Amazon S3 as a backing store for the metadata or data files, then you cannot use a Hadoop catalog pointing to that S3 location to read/write to the table. You would always need to use the Glue catalog to access the table.

    Sources that you connect in Dremio use the following catalogs by default, which can support both read and write commands into the Iceberg tables:

    Dremio Source Type Default Iceberg Catalog
    AWS Glue AWS Glue Iceberg catalog
    Amazon S3 / ADLS Hadoop Iceberg catalog
    Dremio Source Type Default Iceberg Catalog
    Hive Hive Iceberg catalog
    AWS Glue AWS Glue Iceberg catalog
    Amazon S3 / ADLS / GCS / HDFS Hadoop Iceberg catalog

    Iceberg Partitioning in Dremio

    The Apache Iceberg table format uses partitioning as a way to make queries faster by grouping similar rows together when writing. Iceberg can partition timestamps by year, month, day, and hour granularity. For more information about how Iceberg handles partitioning, see Partitioning in the Apache Iceberg docs.

    For specific partition transforms that Dremio supports, see Create Table.

    Dremio software supports Iceberg partitioning as of v21.0+, including:

    For specific partition transforms that Dremio supports, see Creating Apache Iceberg Tables > Syntax.

    Supported Iceberg Metadata Functions

    Iceberg includes helpful system table references which provides an easy method for users to access Iceberg specific information on a table, including:

    • A table’s history
    • The snapshots for a table
    • The data files for a table
    • The manifest files for a table

    For information about querying metadata in Iceberg tables, see SELECT.
    For information about querying metadata in Iceberg tables, see Querying Apache Iceberg Tables > Metadata Queries.

    Limitations

    The following are limitations with Apache Iceberg as implemented in Dremio:

    • Only Parquet file formats are currently supported. Other formats (such as ORC and Avro) are not supported at this time.
    • Amazon DynamoDB and JDBC catalogs are currently not supported.
    • Unable to use DynamoDB as a lock manager with the Hadoop catalog on Amazon S3.
    • Cannot create Iceberg tables using the within-partition sort capability. Additionally, this sort is not maintained in an Iceberg table when writing DML statements to it.
    • Iceberg tables written with merge-on-read equality or positional delete are not supported.
    • Incremental reflections are not supported. For information about reflections, see Accelerating Queries with Reflections.
    • Iceberg tables written with merge-on-read equality deletes are not supported.
    • Incremental reflections are not supported. For information about reflections, see Data Reflections.