Monitoring Dremio
To maximize your investment in Dremio and to proactively identify and resolve issues related to Dremio before they have a broader impact on workload, it’s important to deploy a good monitoring solution. The solution should ensure overall cluster health and performance. Dremio provides a large set of metrics that can be utilized for a monitoring solution along with infrastructure related metrics such as CPU, memory, etc.
Dremio can be deployed in a variety of ways, so the collection and reporting of infrastructure related metrics depends on your deployment model (physical or virtual host, Kubernetes, or cloud). You can collect metrics that are important for monitoring via JMX, JDBC or ODBC, or the REST API. You can monitor Dremio directly, with open-source tools such as Prometheus/Grafana, or with commercially available tools such as AppDynamics, Datadog, etc. It is important to align the monitoring solution with existing monitoring infrastructure.
JMX Metrics
Java Management Extensions (JMX) is a specification for monitoring and managing Java applications. Since Dremio is a Java application, it uses JMX to expose a number of important metrics using this framework.
For information about modifying dremio.conf to enable JMX metrics, see Enabling Node Metrics.
This configuration exposes JMX metrics on the http://<dremio host>:<port>/metrics
URL. The following table lists important metrics, which Dremio node role supports them, and alert threshold details.
Metric Name | Description | Master Coordinator | Secondary Coordinator | Executor | Alert Threshold (if any) |
---|---|---|---|---|---|
jobs.active | Currently active jobs | Yes | Yes | No | |
jobs.active_15m | Number of jobs in 15-minute period | Yes | No | No | |
jobs.failed_15m | Number of failed jobs in 15-minute period | Yes | No | No | WARN: 5% of total jobs CRITICAL: 10% of total jobs |
jobs.queue.<queue_name>.waiting | Number of current waiting jobs | Yes | No | No | WARN: > 0 CRITICAL: > 50% of allowed concurrency for the queue |
dremio.memory.direct_current | Direct memory used by the execution engine | Yes | Yes | Yes | WARN: 90% of allocated value CRITICAL: 95% of allocated value |
fragments.active | Number of active query fragments (threads): Indicator on how starved Dremio is for CPU (monitor on executors) | Yes | Yes | Yes | WARN: 0.9 x number of CPU cores on executors for 5 min CRITICAL: 1 x number of CPU cores on executors for 5 min |
gc.G1-Young-Generation.time/gc.G1-Young-Generation.count gc.G1-Old-Generation.time/gc.G1-Old-Generation.count | Time per GC event (for young and old generations), assuming default G1GC is used for Garbage collection | Yes | Yes | Yes | WARN: > 10s CRITICAL: > 1m |
memory.heap.usage | Ratio of memory.heap.used to memory.heap.max | Yes | Yes | Yes | WARN: > 75% CRITICAL: > 80% Monitor on Coordinator. Coordinator’s JVM monitor automatically kills queries when heap utilization is > 85%. |
reflections.failed | Currently failed data reflections | Yes | No | No | WARN: > 0 CRITICAL: > 10% of all reflections |
reflections.active | Currently active data reflections | Yes | No | No | |
reflections.refreshing | Data reflections currently refreshing or pending a refresh | Yes | No | No | |
rpc.failure_15m | RPC connection failures in 15 minute period | Yes | Yes | Yes | WARN: seen on ~10% of available executors CRITICAL: seen on >25% of available executors |
API Metrics
The Dremio coordinator exposes the REST API on the web UI port, which is 9047 by default. For more information about how to connect, authenticate, and submit requests, see API Reference.
API endpoint | Description | Expected Values | Alert Threshold (if any) |
---|---|---|---|
GET /apiv2/server_status | Coordinator Status | “OK” | Anything other than “OK” or a slow response indicates that coordinator is either down or unhealthy |
GET /api/v3/source | Source Status | “good” | Any status other than “good” |
SQL Metrics
SQL commands can be executed using ODBC, JDBC, or REST interfaces.
When executing SQL with the REST API, the API call does not return query results. The API returns the ID of the submitted query.
|SQL|Description|Alert Threshold (if any)|
|---|---|
|SELECT COUNT(*) FROM sys.memory
|Canary query. Can be executed against a user dataset (small) or a Dremio internal table. This query should return results in ms. It is an indicator to overall cluster health.|WARN: > 30s
CRITICAL: > 1m|
POSIX Metrics
These metrics monitor metrics outside of Dremio that have an impact on stability and performance of Dremio.
|Command|Description|Alert Threshold (if any)|
|---|---|
|df -h <directory where Rocks DB is mounted>
|Catalog DB free space. Catalog DB must have free space to create reflections, update profiles, and run jobs.|WARN: > 80% used
CRITICAL: > 90% used|