OPTIMIZE TABLE
Rewrites data and manifest files to provide peak performance.
For more information, see Optimization.
To run OPTIMIZE TABLE, you must meet one of the following requirements:
- Membership in the ADMIN role
- Grants of SELECT and WRITE privileges on the table
- Table ownership
OPTIMIZE TABLE <table_name>
[ FOR PARTITIONS <predicate> ]
FOR PARTITIONS is available but planned for deprecation. Use automatic table maintenance for ongoing optimization instead.
Parameters
<table_name> String
The path of the Iceberg table that you want to optimize.
FOR PARTITIONS <predicate> Optional
Selects specific partitions for optimization. Consider using this filter when:
- Selected partitions are queried more often or are of higher importance.
- Selected partitions are more active and are constantly being updated.
- You want to avoid optimizing the entire table.
The <predicate> can be any combination of logical and mathematical conditions. However, the supported predicate syntax differs depending on whether the table uses identity or non-identity partitions.
Identity Partitions
For tables using identity partitions (partitioned directly on a column, such as PARTITION BY name), you can use equality conditions and expressions:
column_1=1: Optimizes all data files in the partition wherecolumn_1equals 1.column_2 LIKE 'a%': Optimizes all data files in the partition wherecolumn_2values start with the letter "a".date_column >= '2024-01-01': Optimizes thedate_columnpartition with dates on or after January 1, 2024.
Non-Identity Partitions
For tables using non-identity partitions (partitioned using a transformation on a column, such as PARTITION BY (MONTH(ts))), only equality conditions on the source column are supported. Expressions that reference the partition transformation directly are not allowed and will result in an error.
Supported equality conditions:
| Partition Definition | Supported Predicate | Description |
|---|---|---|
PARTITION BY (MONTH(ts)) | ts >= TIMESTAMP '2025-06-01' | Filters based on the source timestamp column |
PARTITION BY (MONTH(ts)) | ts = TIMESTAMP '2025-06-15' | Exact match on source column |
PARTITION BY (TRUNCATE(1, name)) | name = 'Alice' | Equality on the source column |
PARTITION BY (YEAR(date_col)) | date_col > '2024-01-01' | Range condition on source column |
Unsupported expressions (will error):
| Partition Definition | Unsupported Predicate | Why It Fails |
|---|---|---|
PARTITION BY (MONTH(ts)) | MONTH(ts) = 6 | Cannot use transformation function in predicate |
PARTITION BY (TRUNCATE(1, name)) | name LIKE 'A%' | LIKE expressions not supported for non-identity partitions |
PARTITION BY (YEAR(date_col)) | YEAR(date_col) = 2020 | Cannot reference partition transformation directly |
PARTITION BY (TRUNCATE(10, amount)) | amount / 10 = 5 | Arithmetic expressions not supported |
If the specified column is not partitioned, Dremio will display an error message. If a relevant partition cannot be found, Dremio will not optimize any files. When no partition filters are specified, all data files will be candidates for optimization.
OPTIMIZE Output
Each OPTIMIZE run returns a result set describing what was accomplished:
| Output Field | Description |
|---|---|
| Files rewritten | Number of data files compacted in this run |
| Bytes processed | Total data volume rewritten |
| Table optimization state | Current health/state of the table |
| Additional runs needed | Indicates whether more OPTIMIZE runs are required |
When Additional runs needed is true, re-run OPTIMIZE TABLE <table_name> until the output confirms the table is fully optimized. For more information on incremental optimization behavior, see Optimization in the Apache Iceberg reference.
Optimize Clustered Data Tables
For clustered tables, OPTIMIZE TABLE incrementally reorders data to achieve the optimal data layout and manages file sizes. When working with clustered data tables, you may need to run multiple OPTIMIZE TABLE operations to achieve optimal data layout.
- Start with
OPTIMIZE TABLEto initiate reclustering of all records. - Use
SELECT * FROM TABLE(clustering_information('table_name'))to check the resulting clustering depth. - Since Dremio optimization works incrementally to reduce the current clustering depth toward the target (default: 3), you may need to run
OPTIMIZE TABLEmultiple times, checking theclustering_information()function after each run to monitor progress. This mechanism may take longer to run on newly loaded or unsorted tables. - Continue this process until the clustering depth falls below your target threshold or until successive runs show minimal improvement.
Configuring Compaction
To optimize physical file management and compaction, OPTIMIZE TABLE combines small files with their closest cluster and rewrites large files into appropriate sizes while maintaining proper data organization according to the clustering keys.
Limitations
- You can run only one
OPTIMIZE TABLEquery at a time per table partition. - Sort ordering within partitions is not supported.
Examples
Optimize a tableOPTIMIZE TABLE demo.example_table
OPTIMIZE TABLE demo.example_table
FOR PARTITIONS state='NEBRASKA'