Automatic Optimization
As Apache Iceberg tables are written to and updated, data and metadata files accumulate, which can affect query performance. For example, small files produced by data ingestion jobs slow queries because the query engine must read more files.
To optimize performance, Dremio automates table maintenance in the Open Catalog. This process compacts small files into larger ones, partitions data based on the values of a table's columns, rewrites manifest files, removes position delete files, and clusters tables—improving query speed while reducing storage costs.
Automatic optimization runs on a dedicated engine configured by Dremio, ensuring peak performance without impacting project query workloads.
When Dremio optimizes a table, it evaluates file sizes, partition layout, and metadata organization to reduce I/O and metadata overhead. Optimization consists of five main operations: clustering, data file compaction, partition evolution, manifest file rewriting, and position delete files.
Clustering
Iceberg clustering sorts individual records in data files based on the clustered columns provided in the CREATE TABLE or ALTER TABLE statement.
To cluster a table, you must first define the clustering keys. Then, automatic optimization uses the clustering keys to optimize tables. For details, see Cluster Tables.
Data File Compaction
Iceberg tables that are constantly being updated can have data files of various sizes. As a result, query performance can be negatively affected by sub-optimal file sizes. The optimal file size in Dremio is 256 MB.
Dremio logically combines smaller files and splits larger ones to 256 MB (see the following graphic), helping to reduce metadata overhead and costs related to opening and reading files.
Partition Evolution
To improve read or write performance, data is partitioned based on the values of a table's columns. If the columns used in a partition evolve over time, query performance can be impacted when the queries are not aligned with the current segregations of the partition. Dremio detects and rewrites these files to align with the current partition specification. This operation is used:
- When select partitions are queried more often or are of more importance (than others), and it's not necessary to optimize the entire table.
- When select partitions are more active and are constantly being updated. Optimization should only occur when activity is low or paused.
Manifest File Rewriting
Iceberg uses metadata files (or manifests) to track point-in-time snapshots by maintaining all deltas as a table. This metadata layer functions as an index over a table’s data and the manifest files contained in this layer speed up query planning and prune unnecessary data files. For Iceberg tables that are constantly being updated (such as the ingestion of streaming data or users performing frequent DML operations), the number of manifest files that are suboptimal in size can grow over time. Additionally, the clustering of metadata entries in these files may not be optimal. As a result, suboptimal manifests can impact the time it takes to plan and execute a query.
Dremio rewrites these manifest files quickly based on size criteria. The target size for a manifest file is based on the Iceberg table's property. If a default size is not set, Dremio defaults to 8 MB. For the target size, Dremio considers the range from 0.75x to 1.8x, inclusive, to be optimal. Manifest files exceeding the 1.8x size will be split while files smaller than the 0.75x size will be compacted.
This operation results in the optimization of the metadata, helping to reduce query planning time.
Position Delete Files
Iceberg v2 added the ability for delete files to be encoded to rows that have been deleted in existing data files. This enables you to delete or replace individual rows in immutable data files without the need to rewrite those files. Position delete files identify deleted rows by file and position in one or more data files, as shown in the following example.
file_path | pos |
|---|---|
file:/Users/test.user/Downloads/gen_tables/orders_with_deletes/data/2021/2021-00.parquet | 6 |
file:/Users/test.user/Downloads/gen_tables/orders_with_deletes/data/2021/2021-00.parquet | 16 |
Dremio can optimize Iceberg tables containing position delete files. This is beneficial to do because when data files are read, the associated delete files are stored in memory. Also, one data file can be linked to several delete files, which can impact read time.
When tables are optimized in Dremio, the position delete files are removed and the data files that are linked to them are rewritten. Data files are rewritten if any of the following conditions are met:
- The file size is not within the optimum range.
- The partition's specification is not current.
- The data file has an attached delete file.
Related Topics
- Apache Iceberg – Learn more about the Apache Iceberg table format.
- Load Data Into Tables – Load data from CSV, JSON, or Parquet files into existing Iceberg tables.