Version: current [26.x]

Monitoring Dremio Nodes

There are various approaches for operational monitoring of Dremio nodes. This topic discusses both:

Prometheus metrics, which can be leveraged with tools like Grafana to ensure the stability and performance of Dremio deployments.
queries.json, a log file generated by Dremio, which can be used to calculate various service-level agreements (SLAs) related to query performance.

While these two datasets can be used in similar ways, Prometheus metrics are less granular than queries.json—the latter allows you to drill down into which specific kinds of queries or users are experiencing SLA breaches.

Enabling Node Metrics

Dremio enables node monitoring by default. Starting in Dremio 26.0, each node in the cluster exposes Prometheus metrics via the /metrics endpoint on port 9010.

Available Prometheus Metrics

The following table describes the Prometheus metrics provided by Dremio and specifies which Dremio node roles support them:

Metric Name	Description	Main Coordinator	Executor
`jobs_active`	Gauge showing the number of currently active jobs	Yes	No
`jobs_total`	Counter of total jobs submitted, categorized by the type of query	Yes	No
`jobs.failed`	Counter of failed jobs categorized by query types	Yes	No
`jobs.waiting`	Gauge of currently waiting jobs categorized by queue	Yes	No
`dremio.memory.jvm_direct_current`	Total direct memory (in bytes) given to the JVM	Yes	Yes
`memory.heap.committed`	Committed heap memory as described in Class MemoryUsage in the Oracle documentation	Yes	Yes
`memory.heap.init`	Initialized heap memory as described in Class MemoryUsage in the Oracle documentation	Yes	Yes
`memory.heap.max`	Maximum amount of heap memory that can be allocated as described in Class MemoryUsage in the Oracle documentation	Yes	Yes
`memory.heap.usage`	Ratio of used heap memory to max heap memory, as described in Class MemoryUsage in the Oracle documentation	Yes	Yes
`memory.heap.used`	Amount of used heap memory, as described in Class MemoryUsage in the Oracle documentation	Yes	Yes
`memory.non-heap.committed`	Amount of non-heap memory committed, as described in Class MemoryUsage in the Oracle documentation	Yes	Yes
`memory.non-heap.init`	Initialized non-heap memory as described in Class MemoryUsage in the Oracle documentation	Yes	Yes
`memory.non-heap.max`	Maximum amount of non-heap memory that can be allocated, as described in Class MemoryUsage in the Oracle documentation	Yes	Yes
`memory.non-heap.usage`	Ratio of used non-heap memory to max non-heap memory, as described in Class MemoryUsage in the Oracle documentation	Yes	Yes
`memory.non-heap.used`	Amount of used non-heap memory, as described in Class MemoryUsage in the Oracle documentation	Yes	Yes
`memory.total.committed`	Sum of heap and non-heap committed memory (in bytes)	Yes	Yes
`memory.total.init`	Sum of heap and non-heap initialized memory (in bytes)	Yes	Yes
`memory.total.max`	Sum of the heap and non-heap max memory (in bytes)	Yes	Yes
`memory.total.used`	Sum of the heap and non-heap used memory (in bytes)	Yes	Yes
`reflections.failed`	Scheduled Reflections that have failed and won't be retried	Yes	No
`reflections.unknown`	Reflections for which an error occurred in the Reflection manager and that could not be retried	Yes	No
`reflections.active`	Currently active Reflections	Yes	No
`reflections.refreshing`	Reflections that are currently refreshing or pending a refresh	Yes	No
`reflections.manager_sync`	Time taken to run Reflection management	Yes	No
`threads.blocked.count`	Gauge of currently blocked threads	Yes	Yes
`threads.count`	Gauge of active and idle threads	Yes	Yes
`threads.daemon.count`	Number of currently available active daemon threads	Yes	Yes
`threads.deadlock.count`	Number of currently deadlocked threads	Yes	Yes
`threads.new.count`	Current number of threads in new state (not yet started)	Yes	Yes
`threads.runnable.count`	Current number of threads in runnable state (executing)	Yes	Yes
`threads.terminated.count`	Current number of threads in the terminated state (completed execution)	Yes	Yes
`threads.timed_waiting.count`	Current number of threads in the timed_waiting state	Yes	Yes
`threads.waiting.count`	Current number of threads in the waiting state	Yes	Yes
`jvm.gc.overhead.percent`	An approximate percentage of CPU time used by garbage collection activities	Yes	No

Parameters to Monitor for Scaling Capacity

The following parameters, derived from the queries.json, can help identify when additional engines or vertical scaling are needed to maintain performance.

Query Execution Errors

By reviewing the outcomeReason field in queries.json, you can identify resource-related issues and take proactive steps, such as scaling engines or redistributing workloads, to maintain performance and stability.

Error Type (`outcomeReason`)	Recommended Threshold	Action
`OUT_OF_MEMORY`	1% of queries running out of direct memory	Add an engine and move workload
`RESOURCE ERROR`	1% of queries running out of heap memory	Add an engine and move workload
`ExecutionSetupException`	1% of queries exhibiting node disconnects	Add an engine and move workload
`ChannelClosedException (fabric server)`	1% of queries exhibiting node disconnects	Add an engine and move workload
`CONNECTION ERROR: Exceeded timeout`	1% of queries exhibiting node disconnects	Add an engine and move workload

Job State Durations

Use the job state durations (provided in milliseconds) in the queries.json to address SLA breaches.

Job State (`queries.json`)	Recommended Threshold	Action
Total Duration (finish - start)	p90 SLA aligns with your needs	Add all the states below
Pending (pendingTime)	p90 should not exceed 2000 milliseconds	Vertically scale the main coordinator node
Metadata Retrieval (metadataRetrievalTime)	p90 should not exceed 5000 milliseconds	Switch to a table format if the raw data is Parquet
Planning (planningTime)	p90 should not exceed 2000 milliseconds	Vertically scale the main coordinator node
Queued (queuedTime)	p90 should not exceed 2000 milliseconds	Add an engine and move workload
Execution Planning (executionPlanningTime)	p90 should not exceed 2000 milliseconds	Vertically scale the main coordinator node
Starting (startingTime)	p90 should not exceed 2000 milliseconds	Add an engine and move workload
Running (runningTime)	p90 SLA aligns with your needs	Add an engine and move workload

All italicized values can be found in queries.json as represented in the parentheses above.

Enabling Node Metrics​

Available Prometheus Metrics​

Parameters to Monitor for Scaling Capacity​

Query Execution Errors​

Job State Durations​

Enabling Node Metrics

Available Prometheus Metrics

Parameters to Monitor for Scaling Capacity

Query Execution Errors

Job State Durations