OPTIMIZE TABLE
Rewrites data and manifest files to provide peak performance. For more information, see Optimize Tables.
To run OPTIMIZE TABLE, you must meet one of the following requirements:
- Membership in the
ADMINrole - Grants of
SELECTandUPDATEprivileges on the table - Table ownership
OPTIMIZE TABLE <table_name>
[ REWRITE DATA [ USING BIN_PACK ]
[ ( { TARGET_FILE_SIZE_MB | MIN_FILE_SIZE_MB | MAX_FILE_SIZE_MB | MIN_INPUT_FILES } = <value> [, ... ] ) ]
[ FOR PARTITIONS <predicate> ] ]
[ REWRITE_MANIFESTS ]
Parameters
<table_name> String
The path of the Iceberg table that you want to optimize.
REWRITE DATA [ USING BIN_PACK ] Optional
For partitioned tables, this parameter compacts smaller files or splits larger ones to the optimal file size using the bin-pack strategy. This parameter is used by default to optimize a table, but you can optionally include it in your query to specify additional options to control the file sizes in your selected table. Additionally, since optimization uses only the bin-pack strategy, including the USING BIN_PACK clause is optional. Dremio also optimizes Iceberg v2 tables containing position delete files.
[ ( { TARGET_FILE_SIZE_MB | MIN_FILE_SIZE_MB | MAX_FILE_SIZE_MB | MIN_INPUT_FILES } = <value> [, ... ] ) ] Number Optional
The options available with REWRITE DATA USING BIN_PACK:
| Property Name | Default Value | Description |
|---|---|---|
TARGET_FILE_SIZE_MB | write.target-file-size-bytes target table property, or 512 MB as defined by Apache Iceberg if the table property is not defined. | Controls the target size of the files that are generated. |
MIN_FILE_SIZE_MB | 0.24x TARGET_FILE_SIZE_MB (123 MB at the default TARGET_FILE_SIZE_MB) | Files smaller than MIN_FILE_SIZE_MB qualify for compaction. |
MAX_FILE_SIZE_MB | 1.8x TARGET_FILE_SIZE_MB (922 MB at the default TARGET_FILE_SIZE_MB) | Files larger than MAX_FILE_SIZE_MB qualify for compaction. |
MIN_INPUT_FILES | 5 | The minimum number of qualified files for a compaction operation. When a table is partitioned, MIN_INPUT_FILES specifies the minimum number of qualified files that must be present per partition. |
The command analyzes file groups or partitions in parallel across available engine executors to identify optimization candidates and then performs optimization when at least MIN_INPUT_FILES candidates are present.
FOR PARTITIONS <predicate> Optional
Selects specific partitions for optimizing data files when using REWRITE DATA. Consider using this filter when:
- Selected partitions are queried more often or are of higher importance.
- Selected partitions are more active and are constantly being updated.
- You want to avoid optimizing the entire table.
The predicate can be any combination of logical and mathematical conditions.
column_1=1: Optimizes all data files in the partition wherecolumn_1equals1.column_2 like 'a%': Optimizes all data files in the partition wherecolumn_2values start with the lettera.date_column >= '2024-01-01': Optimizes thedate_columnpartition with dates on or after January 1, 2024.
If the specified column is not partitioned, Dremio will display an error message. If a relevant partition cannot be found, Dremio will not optimize any files. When no partition filters are specified, all data files will be candidates for optimization. The command does not include sorting within partitions.
REWRITE_MANIFESTS Optional
Optimizes the size of Iceberg manifest files when optimizing clustered or partitioned data. When this clause is used, the command must not contain the REWRITE DATA clause.
Dremio's manifest file sizing:
- Default target size:
commit.manifest.target-size-bytesor 8 MB if the table property is not present. - Optimal range: 0.75x to 1.8x the target size
- Files smaller than the minimum range: Will be compacted together
- Files larger than the maximum range: Will be split into smaller files
Optimizing Clustered Data Tables
When working with clustered data tables, you may need to run multiple OPTIMIZE TABLE operations to achieve optimal data layout.
- Start with
OPTIMIZE TABLEto initiate reclustering of all records. - Use
SELECT * FROM TABLE(clustering_information('table_name'))to check the resulting clustering depth. - Since Dremio optimization works incrementally to reduce the current clustering depth toward the target (default: 3), you may need to run
OPTIMIZE TABLEmultiple times, checking theclustering_information()function after each run to monitor progress. - Continue this process until the clustering depth falls below your target threshold or until successive runs show minimal improvement.
Data Table Parameters
To optimize the logical data layout of the table, OPTIMIZE TABLE uses the dremio.clustering table properties from the target table. The default values can be overridden by support keys of the same names.
Configuring Compaction
To optimize physical file management and compaction, OPTIMIZE TABLE also uses the dremio.iceberg.auto.clustering.compact.files.enabled support key to control file compaction. When true (the default), small files are combined with their closest cluster, and large files are rewritten into appropriate sizes while maintaining proper data organization according to the clustering keys.
Examples
Rewrite data files and manifests to Dremio's optimal sizeOPTIMIZE TABLE demo.example_table
OPTIMIZE TABLE demo.example_table
REWRITE DATA USING BIN_PACK (TARGET_FILE_SIZE_MB=512, MIN_INPUT_FILES=10)
OPTIMIZE TABLE demo.example_table
REWRITE DATA USING BIN_PACK (MIN_INPUT_FILES=10)
FOR PARTITIONS state='NEBRASKA'
OPTIMIZE TABLE demo.example_table
REWRITE_MANIFESTS