Pillar 5 - Operational Excellence
Following a regular schedule of maintenance tasks is key to keeping your Dremio cluster operating at peak performance and efficiency. This pillar provides details about the tasks that should be periodically completed to maintain an operationally healthy Dremio cluster.
Principles
Regularly Evaluate Cluster Resources
As workloads expand and grow in your Dremio cluster, it is important to evaluate the usage of your cluster resources.
Regularly Evaluate Query Performance
Regular query performance reviews will keep the highest cost queries from consuming too many of your cluster resources.
Automate Promotion of Catalog Objects from Lower Environments
Promotion of objects from lower environments should be automated with Dremio’s REST APIs.
Regularly Monitor Dremio Live Metrics
In order to ensure smooth operations of Dremio, it is valuable to collect metrics and take action when appropriate.
Best Practices
Optimize Workload Management Rules
Since the workloads and volumes of queries change over time, Workload Management queue-routing rule settings for query-cost thresholds should be re-evaluated and adjusted to rebalance the proportion of queries that flow to each of the cost-based queues.
You can use statistical quantile analysis on the query history data to determine what the query cost threshold should be between the low-cost and high-cost user query queues in a two-queue setup, or what the threshold should be between the low-cost and medium-cost user query queues and the medium-cost and high-cost user query queues in a three-queue setup.
Ideally, in a two-queue setup, you want to see approximately an 80%/20% split of queries hitting the low/high cost user query queues. In a three-queue setup, you want to see approximately a 75%/15%/10% split of queries hitting the low/medium/high cost user query queues.
Configure Engines
Where possible, leverage engines to isolate workloads. The configuration and use of engines offers several benefits:
- Platform stability - if one engine goes down, it won’t affect other engines
- Flexibility to start and stop engines on demand at certain times of day
- Size engines differently based on workload patterns
- Separate queries from different tenants into their own engine to enable a chargeback model
We recommend separate engines for the following types of workload:
- Reflection refreshes
- Metadata refreshes
- Low-cost queries
- High-cost queries
- ETL workloads (CTAS and Iceberg DML)
In multi-tenant (e.g., multiple departments, geographies, etc.) environments, where chargeback models can be implemented for resource usage, we recommend having a low- and high-cost query engine per tenant.
Optimize Query Performance
When developing the semantic layer, it is best to create the views in each of the three layers according to best practices without the use of reflections, then test querying the application layer views to establish baseline performance.
For queries that appear to be running sub-optimally, we recommend first analyzing the query profile to determine if there are any bottlenecks that can be removed to improve performance. Secondly, if performance issues persist, you can use reflections where they will have most benefit. A well-architected semantic layer will allow reflections to be placed at strategic locations (e.g., in the business layer, where a view is constructed from joining several other views together) so that the fewest number of reflections can benefit the largest volume of queries.
Configure Persistent Logging in Kubernetes Environments
Default logging in Kubernetes goes to STDOUT, which have limited availability time in Kubernetes due to their ephemeral nature. Instead, to help capture logging for longer periods, we recommend persisting logs permanently on disk. This can be done simply with only minor edits to the Dremio Helm charts.
Monitor Dremio via JMX metrics
To maximize your investment in Dremio and to proactively identify and resolve issues related to Dremio before they have a broader impact on workload, it’s important to set up a good monitoring solution. The solution should ensure overall cluster health and performance.
Dremio exposes a large set of cluster and infrastructure metrics via its JMX interface that can be utilized in a monitoring solution. Additional Dremio metrics can be collected via API and JDBC/ODBC.
It is important to align the monitoring solution with existing monitoring infrastructure available in your organization. This might mean leveraging open-source tools such as Prometheus or Grafana, or some commercially available tools such as AppDynamics or Datadog.
See Monitoring Dremio to understand more about options for monitoring a Dremio installation.
Monitor Dremio via sys.jobs_recent and Monitor Page
Periodically view the Monitor page of Dremio. The Catalog Usage tab on the Monitor page provides detailed visualizations and metrics that allow you to track the performance of user queries in the cluster, usage patterns, resource consumption, and user impact. The Monitor page shows the volume of queries processed by hour or day and can help you identify bottlenecks.