Skip to main content
Version: current [26.x]

Monitoring Dremio Nodes

There are various approaches for operational monitoring of Dremio nodes. This topic discusses both:

  • Prometheus metrics, which can be leveraged with tools like Grafana to ensure the stability and performance of Dremio deployments.

  • queries.json, a log file generated by Dremio, which can be used to calculate various service-level agreements (SLAs) related to query performance.

While these two datasets can be used in similar ways, Prometheus metrics are less granular than queries.json—the latter allows you to drill down into which specific kinds of queries or users are experiencing SLA breaches.

Enabling Node Metrics

Dremio enables node monitoring by default. Starting in Dremio 26.0, each node in the cluster exposes Prometheus metrics via the /metrics endpoint on port 9010.

Available Prometheus Metrics

The following table describes the Prometheus metrics provided by Dremio and specifies which Dremio node roles support them:

Metric NameDescriptionMain CoordinatorScale-out CoordinatorExecutor
jobs_activeGauge showing the number of currently active jobsYesYesNo
jobs_totalCounter of total jobs submitted, categorized by the type of queryYesYesNo
jobs.failedCounter of failed jobs categorized by query typesYesNoNo
jobs.waitingGauge of currently waiting jobs categorized by queueYesNoNo
dremio.memory.jvm_direct_currentTotal direct memory (in bytes) given to the JVMYesYesYes
memory.heap.committedCommitted heap memory as described in Class MemoryUsage in the Oracle documentationYesYesYes
memory.heap.initInitialized heap memory as described in Class MemoryUsage in the Oracle documentationYesYesYes
memory.heap.maxMaximum amount of heap memory that can be allocated as described in Class MemoryUsage in the Oracle documentationYesYesYes
memory.heap.usageRatio of used heap memory to max heap memory, as described in Class MemoryUsage in the Oracle documentationYesYesYes
memory.heap.usedAmount of used heap memory, as described in Class MemoryUsage in the Oracle documentationYesYesYes
memory.non-heap.committedAmount of non-heap memory committed, as described in Class MemoryUsage in the Oracle documentationYesYesYes
memory.non-heap.initInitialized non-heap memory as described in Class MemoryUsage in the Oracle documentationYesYesYes
memory.non-heap.maxMaximum amount of non-heap memory that can be allocated, as described in Class MemoryUsage in the Oracle documentationYesYesYes
memory.non-heap.usageRatio of used non-heap memory to max non-heap memory, as described in Class MemoryUsage in the Oracle documentationYesYesYes
memory.non-heap.usedAmount of used non-heap memory, as described in Class MemoryUsage in the Oracle documentationYesYesYes
memory.total.committedSum of heap and non-heap committed memory (in bytes)YesYesYes
memory.total.initSum of heap and non-heap initialized memory (in bytes)YesYesYes
memory.total.maxSum of the heap and non-heap max memory (in bytes)YesYesYes
memory.total.usedSum of the heap and non-heap used memory (in bytes)YesYesYes
reflections.failedScheduled Reflections that have failed and won't be retriedYesNoNo
reflections.unknownReflections for which an error occurred in the Reflection manager and that could not be retriedYesNoNo
reflections.activeCurrently active ReflectionsYesNoNo
reflections.refreshingReflections that are currently refreshing or pending a refreshYesNoNo
reflections.manager_syncTime taken to run Reflection managementYesYesNo
threads.blocked.countGauge of currently blocked threadsYesYesYes
threads.countGauge of active and idle threadsYesYesYes
threads.daemon.countNumber of currently available active daemon threadsYesYesYes
threads.deadlock.countNumber of currently deadlocked threadsYesYesYes
threads.new.countCurrent number of threads in new state (not yet started)YesYesYes
threads.runnable.countCurrent number of threads in runnable state (executing)YesYesYes
threads.terminated.countCurrent number of threads in the terminated state (completed execution)YesYesYes
threads.timed_waiting.countCurrent number of threads in the timed_waiting stateYesYesYes
threads.waiting.countCurrent number of threads in the waiting stateYesYesYes
jvm.gc.overhead.percentAn approximate percentage of CPU time used by garbage collection activitiesYesYesNo

Parameters to Monitor for Scaling Capacity

The following parameters, derived from the queries.json, can help identify when additional engines or vertical scaling are needed to maintain performance.

Query Execution Errors

By reviewing the outcomeReason field in queries.json, you can identify resource-related issues and take proactive steps, such as scaling engines or redistributing workloads, to maintain performance and stability.

Error Type (outcomeReason)Recommended ThresholdAction
OUT_OF_MEMORY1% of queries running out of direct memoryAdd an engine and move workload
RESOURCE ERROR1% of queries running out of heap memoryAdd an engine and move workload
ExecutionSetupException1% of queries exhibiting node disconnectsAdd an engine and move workload
ChannelClosedException (fabric server)1% of queries exhibiting node disconnectsAdd an engine and move workload
CONNECTION ERROR: Exceeded timeout1% of queries exhibiting node disconnectsAdd an engine and move workload

Job State Durations

Use the job state durations (provided in milliseconds) in the queries.json to address SLA breaches.

Job State (queries.json)Recommended ThresholdAction
Total Duration (finish - start)p90 SLA aligns with your needsAdd all the states below
Pending (pendingTime)p90 should not exceed 2000 millisecondsVertically scale the main coordinator node
Metadata Retrieval (metadataRetrievalTime)p90 should not exceed 5000 millisecondsSwitch to a table format if the raw data is Parquet
Planning (planningTime)p90 should not exceed 2000 millisecondsVertically scale the main coordinator node
Queued (queuedTime)p90 should not exceed 2000 millisecondsAdd an engine and move workload
Execution Planning (executionPlanningTime)p90 should not exceed 2000 millisecondsVertically scale the main coordinator node
Starting (startingTime)p90 should not exceed 2000 millisecondsAdd an engine and move workload
Running (runningTime)p90 SLA aligns with your needsAdd an engine and move workload

All italicized values can be found in queries.json as represented in the parentheses above.