Automatic Optimization
Query performance may be impacted as you ingest and make changes to data. For example, small files produced by data ingestion jobs result in slower queries because the query engine needs to read more files.
Arctic can automate optimization tasks for tables within the catalog to optimize query performance and minimize storage costs. Arctic currently supports automation for the following optimization tasks:
- Table optimization, which compacts small files into larger files.
- Table cleanup, which uses the VACUUM CATALOG command to delete expired snapshots and orphaned metadata files for Iceberg tables in Arctic catalogs.
- Ensure you have sufficient Role-Based Access Control (RBAC) privileges to manage Arctic optimization resources, schedules, and jobs.
- For more details on how Dremio optimizes Iceberg tables, see Optimizing Tables.
Managing Optimization Resources
Arctic uses compute and storage resources in your cloud account for automated table maintenance, so data processing and storage stays in your account. Specifically, Arctic uses the following:
- VM instances to perform table maintenance tasks
- An Amazon S3 bucket path or Azure storage container to store log files associated with maintenance jobs
- Appropriate data access credentials that will allow read/write access to the tables in the Arctic catalog and to the log file location
The default Arctic catalog for your Sonar project is automatically configured to use the Sonar project’s cloud resources for Arctic optimization jobs. You need to configure optimization resources for any other Arctic catalogs you create.
To configure optimization resources for an Arctic catalog, complete the following steps:
-
Go to Arctic Catalogs and select your catalog.
-
Click
and then select Configuration from the catalog settings sidebar.
-
For Cloud, select the cloud that you want to use to run optimization tasks on. To add a new cloud, see Managing Clouds. For information about setting up AWS and Azure clouds, see Configuring Cloud Resources Manually.
To continue the configuration, follow the steps for your cloud provider.
- AWS
- Azure
- For Engine size, select an engine size to be used for optimization tasks. We recommend a small engine size, which minimizes compute costs.
- For Data location, enter the S3 path where you want Arctic to store the metadata files associated with maintenance operations (e.g.,
s3://your-bucket-name/folder
). - For Data access credential, select Access Key or IAM Role. These credentials will be used to read and write to tables in the Arctic catalog and to the log file location you specified above.
- If you select Access Key, provide the AWS access key ID and AWS secret access key.
- If you select IAM Role, you will need to:
- Use the Trust Account ID and External ID values that appear to create an AWS IAM role.
- Copy the AWS Cross-Account Role ARN and the AWS Instance Profile ARN values from your AWS account and enter them in the respective configuration fields.
- Click Add.
- For Engine size, select an engine size to be used for optimization tasks. We recommend a small engine size, which minimizes compute costs.
- For Data location, enter the ADLS path where you want Arctic to store the metadata files associated with maintenance operations (e.g.,
adls://your-container-name
). Note: the ADLS path can only accommodate a blob container name and not a folder suffix. - For Tenant ID, paste the ID of the Azure tenant.
- For Account Name, paste the name of the storage account that you want to use for the project store.
- For Client ID, paste the application (client) ID of the registered application you are using for storage access.
- For Client Secret, paste the value of the client secret that you created in the registered application.
- (Optional) If you want to use a tailored set of permissions, copy the policy template to create a custom role for Dremio Cloud to manage storage resources.
- Click Add.
You can also use the Arctic Engine API to configure, update, and retrieve information about compute engines that run optimization tasks.
Managing Table Optimization Schedules
To manage optimization schedules and parameters for a table, complete the following steps:
-
Go to Arctic Catalogs and select your catalog.
-
On the Catalog page, hover over your table and click
to see the dataset settings.
In the Dataset Settings dialog, you can trigger a one-off optimization job or set up scheduled optimization for your table.
Optimize Once
To trigger a one-off optimization job, complete the following steps:
-
Click Optimize Once.
-
In the Advanced Optimization Settings dialog, set the following fields:
-
Under Target file size (MB), enter a target file size (in MB) of the files that are generated.
-
Under Min. input files, enter the minimum number of qualified files needed to be considered for optimization.
-
Under Min. file size (MB), enter the minimum file size (in MB) to be considered for optimization.
-
Under Max. file size (MB), enter the maximum file size (in MB) to be considered for optimization.
-
-
Click Optimize.
Optimize Regularly
To set up scheduled optimization for for a table, complete the following steps:
-
Under Optimize regularly, click Set schedule.
-
For Every, select hours, day, week, or month as the frequency.
-
If selecting hours, enter a number. For example, you can run an optimization job every 6 hours:
-
If selecting day, enter a time. For example, you can run an optimization job at 18:00 every day:
-
If selecting week, choose the day(s) of the week and enter a time. For example, you can run an optimization job on Wednesday and Saturday at 18:00 every week:
-
If selecting month, select whether you want to run an optimization job on a certain date or day of the month, and then enter a time. For example, you can run an optimization job on the 15th day of every month, or you can run the job on the 1st Sunday of every month:
-
-
Under Advanced Configuration, set the following options (see the OPTIMIZE TABLE parameters for information):
-
Under Target file size (MB), enter a target file size (in MB) of the files that are generated.
-
Under Min. input files, enter the minimum number of qualified files needed to be considered for optimization.
-
Under Min. file size (MB), enter the minimum file size (in MB) to be considered for optimization.
-
Under Max. file size (MB), enter the maximum file size (in MB) to be considered for optimization.
-
-
Click Save.
- Optimization schedules and parameters for tables are branch-specific, so the same table in another branch is not affected. You can use this isolation to experiment with different optimization parameters.
- Before an optimization job runs, a check is performed to determine whether new snapshots were added to the Iceberg table since the last successful optimization job was run. If not, the optimization job is canceled to save compute costs.
- You can also use the Arctic Jobs API and Schedules API to run a one-off optimization job and manage optimization schedules respectively.
Enabling Table Cleanup and Setting the Cutoff Policy
You can enable table cleanup and set the cutoff policy for the Arctic catalog in the Dremio console. Dremio uses the cutoff policy to determine which snapshots and associated files to expire and delete. For example, if you set the cutoff policy to 5 days, Dremio expires and deletes snapshots that are older than 5 days. After you enable table cleanup, Dremio automatically runs the VACUUM CATALOG command for the catalog once every 24 hours for the Arctic catalog according to the cutoff policy you set.
To enable automatic table cleanup with VACUUM CATALOG and set the cutoff policy in the Dremio console:
-
Go to Arctic Catalogs and click
for your catalog.
-
Select Table Cleanup from the catalog settings sidebar.
-
Toggle Enable table cleanup to enable the setting.
-
Under Expire and delete all table snapshots older than, enter the number of days that Dremio should use as the cutoff policy for the catalog (minimum 1; maximum 90).
-
Click Save.
If you use the Dremio console to enable table cleanup and set the cutoff policy, the VACUUM CATALOG command runs once each day at midnight in UTC time (00:00 UTC). If you want Dremio to run the command at a different interval and time, use the Arctic Catalog Schedules API to create or modify the schedule.
If you disable table cleanup in the Dremio console or with the Arctic Vacuum API, Dremio does not retain the customized schedule made with the Arctic Catalog Schedules API. If you re-enable table cleanup, you must recreate the customized schedule.
If you change the cutoff policy in the Dremio console after setting a custom table cleanup schedule using the Arctic Catalog Schedules API, Dremio retains the custom schedule.
Cutoff Policy Overrides
You can use override properties to specify cutoff policies for individual tables, overriding catalog-level cutoff policies.
Use these override properties to specify a different cutoff policy for the table:
history.expire.max-snapshot-age-ms
history.expire.min-snapshots-to-keep
You can also set the undocumented Iceberg table property nessie.gc.enabled
to false
to skip removal of expired snapshots and orphaned metadata files for the table.
Table-level overrides with history.expire.max-snapshot-age-ms
can lengthen but not shorten the cutoff policy for individual tables. When a table's properties include an override for the cutoff policy, Dremio compares the table-level and catalog-level cutoff policy values and applies the larger value. For example, if the catalog's cutoff policy value is 5 days and you set a table-level override of 10 days for Table1, Dremio retains Table1 snapshots and files for 10 days.
With overrides, you can set different cutoff policies for tables on different branches. For example, suppose Table1 at branch main
has 11 snapshots, and you set the history.expire.min-snapshots-to-keep
table-level property to keep at least 5 snapshots. At the same time, Table1 at branch newBranch
also has 11 snapshots, but you set the table-level property to keep 2 snapshots. After the VACUUM CATALOG command runs, Table1 at main
will retain 5 snapshots, and Table1 at newBranch
will retain 2 snapshots.
Table Cleanup Results and Output
When the VACUUM CATALOG command runs, Dremio removes expired snapshots and orphaned metadata files for tables in the catalog if the following conditions are met:
- The snapshot is older than the number of days specified in the catalog settings or, if you created a custom schedule using the Vacuum API, the defaultCutoffPolicy parameter.
- The snapshot is not in the branch or tag head. Dremio always retains the branch or tag head, even if it was created before the specified cutoff policy.
For example, suppose you set a catalog's cutoff policy to 7, and Dremio retains 7 days of table history for each table in the catalog. Queries that use commits that are less than 7 days old succeed unless the query refers to the history of a table before the cutoff policy. Users can roll back the catalog to any commit that is less than 7 days old. Users should not create tags or branches using a commit that is more than 7 days old.
In the Dremio console, if you run the VACUUM CATALOG command manually in the SQL Runner, the output is a table that includes two columns:
Column Name | Data Type | Description |
---|---|---|
deleted_files_count | Integer | The number of files the VACUUM CATALOG command removed. |
deleted_files_total_size_mb | Integer | The total size of the files the VACUUM CATALOG command removed in megabytes. |
You can also use the Job API to retrieve the number of files removed by the VACUUM CATALOG command.
After Dremio runs the VACUUM CATALOG command, users cannot roll back to the previous table state for expired commits. Users can still view the commit tree and metadata for expired commits.
Common Table Cleanup Scenarios
This section describes the results of the VACUUM CATALOG command in several common scenarios. In the images that illustrate the examples, each square represents a commit, and the dotted line represents the cutoff policy. Gray commits are older than the cutoff policy, and green commits were created after the cutoff policy. Red snapshot numbers indicate that the snapshots are expired and deleted according to the cutoff policy.
Snapshots Older than the Cutoff Policy
The image below depicts how the VACUUM CATALOG command affects snapsots that are older than the cutoff policy. In this scenario, snapshots S1 and S2 are older than the cutoff policy. After Dremio runs the VACUUM CATALOG command, users cannot query these snapshots. Only snapshots that are not expired according to the cutoff policy are available. Queries on Table 1 at commit C1 or C2 will result in an error, but queries on Table 1 at commit C3 or C4 will run.

Branch Older than the Cutoff Policy
The image below depicts how the VACUUM CATALOG command affects branches and commits that are older than the cutoff policy. In this scenario, branch D1 is older than the cutoff policy. After Dremio runs the VACUUM CATALOG command, the latest snapshot on the table in branch D1 remains: snapshot S2. As a result, users can query snapshot S2 on all commits beyond commit C1.
Tables in commits that are created after the cutoff policy retain their snapshots until their snapshots are expired according to the cutoff policy. For commits C3 and C4, snapshots S3 and S4 are retained. Users cannot query snapshot S1 on commit C1, and queries that reference snapshot S1 fail.

Commit Created After the Cutoff Policy
The image below depicts how the VACUUM CATALOG command affects commits created after the cutoff policy. In this scenario, Table 1 was deleted from the main branch but exists on commit D2, which is older than the cutoff policy. After Dremio runs the VACUUM CATALOG command, users can query Table 1 on commit D2. Any orphaned metadata files for Table 1 on the main branch are removed.

Table Cleanup Limitations
The VACUUM CATALOG command has the following limitations:
- It is only supported for Iceberg tables in Arctic catalog sources.
- It does not remove snapshots and files for tables that are deleted before the cutoff policy or for tables that exist only on a deleted branch or tag in a Arctic catalog.
- It does not clean up any tables that exist only on deleted branches or tags.
There are several limitations that affect the outcome of VACUUM CATALOG jobs:
- Dremio does not prevent operations on expired commits, so users can create branches and tags from and roll back to commits that are older than the specified cutoff policy.
- Dremio does not group expiry commits for each table. VACUUM CATALOG creates a single expiry commit for each table.
- Historical commits are immutable. Iceberg metadata queries on historical commits will continue to show all snapshots after the VACUUM CATALOG command runs, including snapshots that VACUUM CATALOG cleaned up. However, SELECT queries will retrieve only live data.
- Arctic commit logs do not distinguish between expired commits and live commits.
Managing Table Optimization and Table Cleanup Jobs
Optimization job records for both table optimization and table cleanup jobs are stored on the Jobs page. The owners of optimization jobs can see their own job history. Owners of an Arctic catalog can see all optimization jobs for the catalog.
To view optimization job history, complete the following steps:
-
From the Arctic Catalogs page, select the catalog that you want to manage.
-
Click
in the side navigation bar.