On this page

    Optimizing Tables

    Dremio integrates support for Apache Iceberg tables, an open-source data lakehouse table format designed for petabyte-scale tables. Iceberg tables bring the reliability and simplicity of SQL tables to the data lakehouse. Iceberg tables also function in a fully-open and accessible manner that allows multiple engines (such as Dremio, Spark, etc.) to operate on the same dataset. However, as these tables are constantly being written to and edited, the number of data files accumulate. Additionally, the accumulated data files can be small. Such accumulation of small files impacts query performance. Queries can gradually become less efficient because of the increased processing time required to open and search through these files.

    Dremio enables you to optimize query performance to maximize the speed and efficiency with which data is retrieved. Performance optimization occurs by running SQL statements to optimize Iceberg tables. These SQL statements can be run for all Iceberg tables and catalogs in Dremio.

    Dremio integrates support for Apache Iceberg tables, an open-source data lakehouse table format designed for petabyte-scale tables. Iceberg tables bring the reliability and simplicity of SQL tables to the data lakehouse. Iceberg tables also function in a fully-open and accessible manner that allows multiple engines (such as Dremio, Spark, etc.) to operate on the same dataset. However, as these tables are constantly being written to and edited, the number of data files accumulate. Additionally, the accumulated data files can be small. Such accumulation of small files impacts query performance. Queries can gradually become less efficient because of the increased processing time required to open and search through these files.

    Dremio enables you to optimize query performance to maximize the speed and efficiency with which data is retrieved. Performance optimization occurs by running SQL statements to optimize Iceberg tables. These SQL statements can be run for all Iceberg tables and catalogs in Dremio.

    How Dremio Optimizes a Table

    Dremio enables you to rewrite data files (via compaction) using the optimize functionality. Logically combine small files into an optimal file size or split large files to reduce metadata overhead and runtime file open costs. This functionality also enables you to repartition the data when a table’s partition has changed.

    You may need to optimize Iceberg tables in Dremio due to sub-optimal file sizes and an evolving partition scheme. The optimize functionality supports the bin_pack clause, enabling users to bin pack files in partitions they actively write to.

    Sub-optimal File Sizes

    For Iceberg tables that are constantly being updated (adding, editing, and deleting data), this inconsistent access can create data files of various sizes. As a result, query performance can be negatively affected by inconsistent file sizes. The optimal file size in Dremio is 256 MB. The optimize functionality logically combines smaller files and splits larger ones to 256 MB (see the following graphic), helping to reduce metadata overhead and costs related to opening and reading files.

    Optimizing file sizes in Dremio.

    Evolving Partition Scheme

    To help your queries be more efficient when run, data can be partitioned based on the values of a table’s columns. If the columns used in a partition evolve over time, query performance can be impacted when the queries are not aligned with the current segregations of the partition. Optimizing a table identifies and rewrites the data files if they do not follow the most recent partition specification.

    For guidance on using the optimize functionality in your queries, see Optimize Table.
    For guidance on using the optimize functionality in your queries, see Optimize Table.

    Routing Optimization Jobs

    You can route jobs that run OPTIMIZE TABLE to specific engines by using a routing rule that uses the query_label() condition. For more information, see Engine Routing.

    Routing Optimization Jobs

    You can route jobs that run OPTIMIZE TABLE to specific queues by using a routing rule that uses the query_label() condition. For more information, see Workload Management.

    Limitations

    • You can run only one optimize query at a time on the selected Iceberg table.
    • The optimize functionality does not support the following capabilities yet:
      • Iceberg tables with delete files.
      • Sort ordering.
      • Rewriting manifest files.