Skip to main content

Pillar 5: Operational Excellence

Following a regular schedule of maintenance tasks is key to keeping your Dremio Cloud project operating at peak performance and efficiency. The operational excellence pillar describes the tasks required to maintain an operationally healthy Dremio Cloud project.

Principles

Regularly Evaluate Engine Resources

As workloads expand and grow on your Dremio Cloud project, it is important to evaluate engine usage to ensure that you have correctly sized engines and the right number of replicas.

Regularly Evaluate Query Performance

Regular query performance reviews help you identify challenges and mitigate them before they become a problem. For example, if you find an unacceptably large number of queries waiting on engine or replica starts, you can adjust the minimum, maximum, and last replica auto-stop settings. If you see an unacceptable number of query execution failures, you can adjust concurrency limits per replica more appropriately or revisit the semantic layer and introduce reflections to improve performance.

Clean Up Tables with Vacuum

When operating on Iceberg tables and using Arctic as a catalog, you can perform table cleanup with vacuum to purge older snapshots of cataloged tables to ensure optimal query performance on these tables.

Optimize Tables

When operating on Iceberg tables and using Arctic as a catalog, you can schedule optimization jobs to help you manage the accumulation of data files that occurs through data manipulation language (DML) operations. Regular maintenance ensures optimal query performance on these tables.

Regularly Monitor Live Metrics for Dremio Cloud

To ensure smooth operations in Dremio Cloud, collect metrics and take action when appropriate. Read Monitoring for more details.

Best Practices

Optimize Workload Management Rules

Because workloads and volumes of queries change over time, you should periodically reevaluate workload management engine routing rules and engines and adjust for optimal size, concurrency, and replica limits.

Configure Engines

When possible, leverage engines to segregate workloads. Configuring engine and usage offers the following benefits:

  • Platform stability: if one engine goes down, it won’t affect other engines.

  • Flexibility to start and stop engines on demand at certain times of day.

  • Engines can be sized differently based on workload patterns.

  • It's possible to separate queries from different tenants into their own engine to enable a chargeback model.

We recommend separate engines for the following types of workloads:

  • Reflection refreshes.

  • Metadata refreshes.

  • API queries.

  • Queries from BI tools.

  • Extract, transform, and load (ETL)-type workloads like CREATE TABLE AS (CTAS) and Iceberg DML.

  • Ad hoc data science queries with long execution times.

In multi-tenant environments like multiple departments or geographic locations where chargeback models can be implemented for resource usage, we recommend having a separate set of engines per tenant.

Optimize Query Performance

When developing the semantic layer, it is best to create the views in each of the three layers according to best practices without using reflections, then test queries of the application layer views to gauge baseline performance.

For queries that appear to be running sub-optimally, we recommend analyzing the query profile to determine whether any bottlenecks can be removed to improve performance. If performance issues persist, place reflections where they will have the most benefit. A well-architected semantic layer allows you to place reflections at strategic locations in the semantic layer such that large volumes of queries benefit from the fewest number of reflections, such as in the business layer where a view is constructed by joining several other views.

Rotation Personal Access Tokens

When Dremio personal access tokens (PATs) are used in custom applications, consider scripting an automated periodic refresh to avoid job failures when the PATs expire.

Monitor Dremio Cloud Projects

It's important to set up a good monitoring solution to maximize your investment in Dremio Cloud and identify and resolve issues related to Dremio Cloud projects before they have a broader impact on workload. Your monitoring solution should ensure overall cluster health and performance.