Skip to main content

Automatic Optimization

Query performance may be impacted as you ingest and make changes to data. For example, small files produced by data ingestion jobs result in slower queries because the query engine needs to read more files.

Arctic can automate optimization tasks for tables within the catalog to optimize query performance and minimize storage costs. Arctic currently supports automation for the following optimization tasks:

  • Table optimization, which compacts small files into larger files.
  • Table cleanup, which uses the VACUUM CATALOG command to delete expired snapshots and orphaned metadata files for Iceberg tables in Arctic catalogs.
note

Managing Optimization Resources

Arctic uses compute and storage resources in your cloud account for automated table maintenance, so data processing and storage stays in your account. Specifically, Arctic uses the following:

  • VM instances to perform table maintenance tasks
  • An Amazon S3 bucket path or Azure storage container to store log files associated with maintenance jobs
  • Appropriate data access credentials that will allow read/write access to the tables in the Arctic catalog and to the log file location
note

The default Arctic catalog for your Sonar project is automatically configured to use the Sonar project’s cloud resources for Arctic optimization jobs. You need to configure optimization resources for any other Arctic catalogs you create.

To configure optimization resources for an Arctic catalog, complete the following steps:

  1. Go to Arctic Catalogs and select your catalog.

  2. Click This is the Catalog Settings icon. and then select Configuration from the catalog settings sidebar.

  3. For Cloud, select the cloud that you want to use to run optimization tasks on. To add a new cloud, see Managing Clouds. For information about setting up AWS clouds, see Configuring Cloud Resources Manually.

To continue the configuration, follow the steps for your cloud provider.

  1. For Engine size, select an engine size to be used for optimization tasks. We recommend a small engine size, which minimizes compute costs.

  2. For Data location, enter the S3 path where you want Arctic to store the metadata files associated with maintenance operations (e.g., s3://your-bucket-name/folder).

  3. For Data access credential, select Access Key or IAM Role. These credentials will be used to read and write to tables in the Arctic catalog and to the log file location you specified above.

    1. If you select Access Key, provide the AWS access key ID and AWS secret access key.

    2. If you select IAM Role, you will need to:

      1. Use the Trust Account ID and External ID values that appear to create an AWS IAM role.

      2. Copy the AWS Cross-Account Role ARN and the AWS Instance Profile ARN values from your AWS account and enter them in the respective configuration fields.

  4. Click Add.
note

You can also use the Arctic Engine API to configure, update, and retrieve information about compute engines that run optimization tasks.

Managing Table Optimization Schedules

To manage optimization schedules and parameters for a table, complete the following steps:

  1. Go to Arctic Catalogs and select your catalog.

  2. On the Catalog page, hover over your table and click This is the Settings icon. to see the dataset settings.

In the Dataset Settings dialog, you can trigger a one-off optimization job or set up scheduled optimization for your table.

Optimize Once

To trigger a one-off optimization job, complete the following steps:

  1. Click Optimize Once.

  2. In the Advanced Optimization Settings dialog, set the following fields:

    • Under Target file size (MB), enter a target file size (in MB) of the files that are generated.

    • Under Min. input files, enter the minimum number of qualified files needed to be considered for optimization.

    • Under Min. file size (MB), enter the minimum file size (in MB) to be considered for optimization.

    • Under Max. file size (MB), enter the maximum file size (in MB) to be considered for optimization.

  3. Click Optimize.

Optimize Regularly

To set up scheduled optimization for for a table, complete the following steps:

  1. Under Optimize regularly, click Set schedule.

  2. For Every, select hours, day, week, or month as the frequency.

    • If selecting hours, enter a number. For example, you can run an optimization job every 6 hours:

    • If selecting day, enter a time. For example, you can run an optimization job at 18:00 every day:

    • If selecting week, choose the day(s) of the week and enter a time. For example, you can run an optimization job on Wednesday and Saturday at 18:00 every week:

    • If selecting month, select whether you want to run an optimization job on a certain date or day of the month, and then enter a time. For example, you can run an optimization job on the 15th day of every month, or you can run the job on the 1st Sunday of every month:

  3. Under Advanced Configuration, set the following options (see the OPTIMIZE TABLE parameters for information):

    • Under Target file size (MB), enter a target file size (in MB) of the files that are generated.

    • Under Min. input files, enter the minimum number of qualified files needed to be considered for optimization.

    • Under Min. file size (MB), enter the minimum file size (in MB) to be considered for optimization.

    • Under Max. file size (MB), enter the maximum file size (in MB) to be considered for optimization.

  4. Click Save.

note
  • Optimization schedules and parameters for tables are branch-specific, so the same table in another branch is not affected. You can use this isolation to experiment with different optimization parameters.
  • Before an optimization job runs, a check is performed to determine whether new snapshots were added to the Iceberg table since the last successful optimization job was run. If not, the optimization job is canceled to save compute costs.
  • You can also use the Arctic Jobs API and Schedules API to run a one-off optimization job and manage optimization schedules respectively.

Enabling Table Cleanup and Setting the Cutoff Policy

You can enable table cleanup and set the cutoff policy for the Arctic catalog in the Dremio console. Dremio uses the cutoff policy to determine which snapshots and associated files to expire and delete. For example, if you set the cutoff policy to 5 days, Dremio expires and deletes snapshots that are older than 5 days. After you enable table cleanup, Dremio automatically runs the VACUUM CATALOG command for the catalog once every 24 hours for the Arctic catalog according to the cutoff policy you set.

To enable automatic table cleanup with VACUUM CATALOG and set the cutoff policy in the Dremio console:

  1. Go to Arctic Catalogs and click This is the Settings icon. for your catalog.

  2. Select Table Cleanup from the catalog settings sidebar.

  3. Toggle Enable table cleanup to enable the setting.

  4. Under Expire and delete all table snapshots older than, enter the number of days that Dremio should use as the cutoff policy for the catalog (minimum 1; maximum 90).

  5. Click Save.

If you use the Dremio console to enable table cleanup and set the cutoff policy, the VACUUM CATALOG command runs once each day at midnight in UTC time (00:00 UTC). If you want Dremio to run the command at a different interval and time, use the Arctic Catalog Schedules API to create or modify the schedule.

note

If you disable table cleanup in the Dremio console or with the Arctic Vacuum API, Dremio does not retain the customized schedule made with the Arctic Catalog Schedules API. If you re-enable table cleanup, you must recreate the customized schedule.

If you change the cutoff policy in the Dremio console after setting a custom table cleanup schedule using the Arctic Catalog Schedules API, Dremio retains the custom schedule.

Managing Table Optimization and Table Cleanup Jobs

Optimization job records for both table optimization and table cleanup jobs are stored on the Jobs page. The owners of optimization jobs can see their own job history. Owners of an Arctic catalog can see all optimization jobs for the catalog.

To view optimization job history, complete the following steps:

  1. From the Arctic Catalogs page, select the catalog that you want to manage.

  2. Click in the side navigation bar.

Viewing Jobs and Job Details

All jobs run in Arctic are listed on a separate page, showing the target, status, user, and other attributes. To navigate to the Jobs page, click in the side navigation bar.

Search Filter and Columns

By default, the Jobs page lists the optimization jobs run within the last 30 days. You can filter on values and manage columns directly on the Jobs page, as shown in this image:

Managing optimization jobs on the Jobs page.
  1. Filter a free form text search bar that enables you to search jobs by table name, reference name (branch, tag, or commit), username, and job ID.

  2. Status represents one or more job states. For descriptions, see Job Statuses.

  3. Type includes Optimize (table optimization) and Vacuum (table cleanup).

  4. User can be searched by typing the username or checking the box next to the username in the dropdown.

  5. List of optimization jobs:

    • Target lists the tables that either are in process or have been optimized in the catalog.

      note

      You can see the reference by hovering over the table name.

    • Status identifies the status of the optimization job. For a description of the available statuses, see Job Statuses.

    • User shows the user's email address.

    • Job Type identifies the type of optimization job: Optimize (table optimization) or Vacuum (table cleanup).

    • Engine Size identifies the engine that was used to run the optimization job.

    • Start Time filters the jobs by the specified timeframe.

    • Duration shows how long the optimization job ran for.

    • Job ID shows the unique ID of the optimization job.

You can also use the Arctic optimization APIs to retrieve job status. For more information, see the Jobs API.

Job Statuses

Each optimization job passes through a sequence of states until it is complete, though the sequence can be interrupted if a query is canceled or if there is an error during a state. This table lists the statuses that the UI lets you filter on and shows how they map to the states:

IconStatusDescription
SetupRepresents a state where the optimization job is in the process of being set up.
QueuedRepresents a state where the optimization job is queued.
StartingRepresents a state where the optimization job is starting up.
RunningRepresents a state where the optimization job is running.
CompletedRepresents a terminal state that indicates that the optimization job is successfully completed. Note: You can see the results of the completed job by hovering over the status.
CancelledRepresents a terminal state that indicates that the optimization job is cancelled or an intervention in the system.
FailedRepresents a terminal state that indicates that the optimization job has failed due to an error. Note: You can see the error message for the failed job by hovering over the status.

Canceling a Job

You can cancel optimization jobs from either the Jobs page or the Dataset Settings dialog.

To cancel a job from the Jobs page, complete the following steps:

  1. From the Arctic Catalogs page, select the catalog that you want to manage.

  2. Click This is the icon that represents the Jobs page. in the side navigation bar.

  3. Locate the optimization job and under the Status column, click Click this Cancel icon to cancel a job..

To cancel a job from the Dataset Settings dialog, complete the following steps:

  1. From the Arctic Catalogs page, select the catalog that you want to manage.

  2. From the Catalog page, hover over the table and click This is the Settings icon. to see the dataset settings.

  3. In the Dataset Settings dialog, click Cancel Job.

Troubleshooting Table Optimization Jobs

This section can help you troubleshoot issues that you might encounter when Optimize-type jobs do not run as expected. The Jobs page shows all the optimization jobs that have been run in a catalog, including the jobs that failed to run.

Issue: The Iceberg table cannot access the metadata location.

Solution: Verify that the data access credentials are valid and has access to the table metadata location.

Issue: The compute resources and/or data access credentials that you configured are no longer valid.

Solution: Check the configuration settings and update the compute resources and/or data access credentials. For instructions, see Managing Optimization Resources.

Issue: The optimization job has failed due to an internal error.

Solution: You can try running the job again. If the problem persists, contact Dremio Support.

Troubleshooting Table Cleanup Jobs

To investigate and resolve issues related to table cleanup jobs, add the Arctic catalog as a source so that you can run the VACUUM CATALOG command manually. Then, refer to the troubleshooting instructions for manual vacuum jobs to interpret the results from the VACUUM CATALOG command and resolve the problem.