Version: current [26.x]

Monitoring Dremio

To maximize your investment in Dremio and to proactively identify and resolve issues related to Dremio before they have a broader impact on workload, it’s important to deploy a good monitoring solution. The solution should ensure overall cluster health and performance. Dremio provides a large set of metrics that can be utilized for a monitoring solution along with infrastructure related metrics such as CPU, memory, etc.

Dremio can be deployed in a variety of ways, so the collection and reporting of infrastructure related metrics depends on your deployment model (physical or virtual host, Kubernetes, or cloud). You can collect metrics that are important for monitoring via JMX, JDBC or ODBC, or the REST API. You can monitor Dremio directly, with open-source tools such as Prometheus/Grafana, or with commercially available tools such as AppDynamics, Datadog, etc. It is important to align the monitoring solution with existing monitoring infrastructure.

JMX Metrics

Java Management Extensions (JMX) is a specification for monitoring and managing Java applications. Since Dremio is a Java application, it uses JMX to expose a number of important metrics using this framework.

note

For information about modifying dremio.conf to enable JMX metrics, see Enabling Node Metrics.

This configuration exposes JMX metrics on the http://<dremio host>:<port>/metrics URL. The following table lists important metrics, which Dremio node role supports them, and alert threshold details.

Metric Name	Description	Main Coordinator	Secondary Coordinator	Executor	Alert Threshold (if any)
jobs.active	Currently active jobs	Yes	Yes	No
jobs.active_15m	Number of jobs in 15-minute period	Yes	No	No
jobs.failed_15m	Number of failed jobs in 15-minute period	Yes	No	No	WARN: 5% of total jobs CRITICAL: 10% of total jobs
jobs.queue.<queue_name>.waiting	Number of current waiting jobs	Yes	No	No	WARN: > 0 CRITICAL: > 50% of allowed concurrency for the queue
dremio.memory.direct_current	Direct memory used by the execution engine	Yes	Yes	Yes	WARN: 90% of allocated value CRITICAL: 95% of allocated value
fragments.active	Number of active query fragments (threads): Indicator on how starved Dremio is for CPU (monitor on executors)	Yes	Yes	Yes	WARN: 0.9 x number of CPU cores on executors for 5 min CRITICAL: 1 x number of CPU cores on executors for 5 min
gc.G1-Young-Generation.time/gc.G1-Young-Generation.count gc.G1-Old-Generation.time/gc.G1-Old-Generation.count	Time per GC event (for young and old generations), assuming default G1GC is used for Garbage collection	Yes	Yes	Yes	WARN: > 10s CRITICAL: > 1m
memory.heap.usage	Ratio of memory.heap.used to memory.heap.max	Yes	Yes	Yes	WARN: > 75% CRITICAL: > 80% Monitor on Coordinator. Coordinator’s JVM monitor automatically kills queries when heap utilization is > 85%.
reflections.failed	Currently failed data Reflections	Yes	No	No	WARN: > 0 CRITICAL: > 10% of all Reflections
reflections.active	Currently active data Reflections	Yes	No	No
reflections.refreshing	Data Reflections currently refreshing or pending a refresh	Yes	No	No
rpc.failure_15m	RPC connection failures in 15 minute period	Yes	Yes	Yes	WARN: seen on ~10% of available executors CRITICAL: seen on >25% of available executors

API Metrics

The Dremio coordinator exposes the REST API on the web UI port, which is 9047 by default. For more information about how to connect, authenticate, and submit requests, see API Reference.

API endpoint	Description	Expected Values	Alert Threshold (if any)
`GET /apiv2/server_status`	Coordinator Status	“OK”	Anything other than “OK” or a slow response indicates that coordinator is either down or unhealthy
`GET /api/v3/source`	Source Status	“good”	Any status other than “good”

SQL Metrics

SQL commands can be executed using ODBC, JDBC, or REST interfaces.

note

When executing SQL with the REST API, the API call does not return query results. The API returns the ID of the submitted query.

|SQL|Description|Alert Threshold (if any)| |---|---| |SELECT COUNT(*) FROM sys.memory |Canary query. Can be executed against a user dataset (small) or a Dremio internal table. This query should return results in ms. It is an indicator to overall cluster health.|WARN: > 30s
CRITICAL: > 1m|

POSIX Metrics

These metrics monitor metrics outside of Dremio that have an impact on stability and performance of Dremio.

|Command|Description|Alert Threshold (if any)| |---|---| |df -h <directory where Rocks DB is mounted>|Catalog DB free space. Catalog DB must have free space to create Reflections, update profiles, and run jobs.|WARN: > 80% used
CRITICAL: > 90% used|

JMX Metrics​

API Metrics​

SQL Metrics​

POSIX Metrics​

JMX Metrics

API Metrics

SQL Metrics

POSIX Metrics