Skip to main content

Pillar 4: Reliability

The reliability pillar focuses on ensuring your system is up and running and can be quickly and efficiently restored in case of unexpected downtime.

Principles

Set Engine Routing Rules and Engine Settings

Dremio Cloud’s engine routing rules and engine settings are powerful and protect the system from being overloaded by queries that exceed currently available resources.

Monitor and Measure Platform Activity

To ensure the reliability of your Dremio Cloud project, you must regularly monitor and measure its activity.

Best Practices

Initialize Engine Routing and Engine Settings

It is important to set up engine routing rules and engines with sensible concurrency, replica, and time limits. It's better to spin replicas at sensible concurrency limits rather than risk a large number of rogue queries bringing down the engine.

Use the Monitor Page in the Dremio Cloud Console

As an administrator using the Dremio Cloud console, you can effectively monitor catalog usage and jobs within your projects. The Monitor page provides detailed visualizations and metrics that allow you to track usage patterns, resource consumption, and user impact.

In the Catalog Usage tab, you can view the 10 most-queried datasets and source folders, along with relevant statistics such as linked jobs and acceleration usage. The Catalog Usage tab excludes system tables and INFORMATION_SCHEMA datasets and focuses solely on user queries.

In the Jobs tab, you can access comprehensive metrics on job performance, including daily job counts, failure rates, and user activity. Visualizations include graphs of completed and failed jobs, job states, and the 10 longest-running jobs, providing an overview of job execution and performance trends.

We recommend that administrators frequently review the Monitor page, including daily consumption patterns and the weekly and monthly aggregate. Monitoring insights like the most queried datasets over time can help administrators optimize performance, adapt a reflection strategy, and leverage the jobs-per-engine distribution to improve workload management and resource allocation.

Perform Impact Analysis if Security Rules Change

Dremio Cloud’s control plane interacts with your own virtual private clouds for query execution. If you make changes to your security rules after they are initially set and working correctly with Dremio Cloud, perform impact analysis to make sure that your connectivity with Dremio Cloud remains unaffected.