Monitoring Dremio Nodes
There are various approaches for operational monitoring of Dremio nodes. This topic discusses both:
-
Prometheus metrics, which can be leveraged with tools like Grafana to ensure the stability and performance of Dremio deployments.
-
queries.json
, a log file generated by Dremio, which can be used to calculate various service-level agreements (SLAs) related to query performance.
While these two datasets can be used in similar ways, Prometheus metrics are less granular than queries.json
—the latter allows you to drill down into which specific kinds of queries or users are experiencing SLA breaches.
Enabling Node Metrics
Dremio enables node monitoring by default. Starting in Dremio 26.0, each node in the cluster exposes Prometheus metrics via the /metrics
endpoint on port 9010.
Available Prometheus Metrics
The following table describes the Prometheus metrics provided by Dremio and specifies which Dremio node roles support them:
Metric Name | Description | Main Coordinator | Scale-out Coordinator | Executor |
---|---|---|---|---|
jobs_active | Gauge showing the number of currently active jobs | Yes | Yes | No |
jobs_total | Counter of total jobs submitted, categorized by the type of query | Yes | Yes | No |
jobs.failed | Counter of failed jobs categorized by query types | Yes | No | No |
jobs.waiting | Gauge of currently waiting jobs categorized by queue | Yes | No | No |
dremio.memory.jvm_direct_current | Total direct memory (in bytes) given to the JVM | Yes | Yes | Yes |
memory.heap.committed | Committed heap memory as described in Class MemoryUsage in the Oracle documentation | Yes | Yes | Yes |
memory.heap.init | Initialized heap memory as described in Class MemoryUsage in the Oracle documentation | Yes | Yes | Yes |
memory.heap.max | Maximum amount of heap memory that can be allocated as described in Class MemoryUsage in the Oracle documentation | Yes | Yes | Yes |
memory.heap.usage | Ratio of used heap memory to max heap memory, as described in Class MemoryUsage in the Oracle documentation | Yes | Yes | Yes |
memory.heap.used | Amount of used heap memory, as described in Class MemoryUsage in the Oracle documentation | Yes | Yes | Yes |
memory.non-heap.committed | Amount of non-heap memory committed, as described in Class MemoryUsage in the Oracle documentation | Yes | Yes | Yes |
memory.non-heap.init | Initialized non-heap memory as described in Class MemoryUsage in the Oracle documentation | Yes | Yes | Yes |
memory.non-heap.max | Maximum amount of non-heap memory that can be allocated, as described in Class MemoryUsage in the Oracle documentation | Yes | Yes | Yes |
memory.non-heap.usage | Ratio of used non-heap memory to max non-heap memory, as described in Class MemoryUsage in the Oracle documentation | Yes | Yes | Yes |
memory.non-heap.used | Amount of used non-heap memory, as described in Class MemoryUsage in the Oracle documentation | Yes | Yes | Yes |
memory.total.committed | Sum of heap and non-heap committed memory (in bytes) | Yes | Yes | Yes |
memory.total.init | Sum of heap and non-heap initialized memory (in bytes) | Yes | Yes | Yes |
memory.total.max | Sum of the heap and non-heap max memory (in bytes) | Yes | Yes | Yes |
memory.total.used | Sum of the heap and non-heap used memory (in bytes) | Yes | Yes | Yes |
reflections.failed | Scheduled Reflections that have failed and won't be retried | Yes | No | No |
reflections.unknown | Reflections for which an error occurred in the Reflection manager and that could not be retried | Yes | No | No |
reflections.active | Currently active Reflections | Yes | No | No |
reflections.refreshing | Reflections that are currently refreshing or pending a refresh | Yes | No | No |
reflections.manager_sync | Time taken to run Reflection management | Yes | Yes | No |
threads.blocked.count | Gauge of currently blocked threads | Yes | Yes | Yes |
threads.count | Gauge of active and idle threads | Yes | Yes | Yes |
threads.daemon.count | Number of currently available active daemon threads | Yes | Yes | Yes |
threads.deadlock.count | Number of currently deadlocked threads | Yes | Yes | Yes |
threads.new.count | Current number of threads in new state (not yet started) | Yes | Yes | Yes |
threads.runnable.count | Current number of threads in runnable state (executing) | Yes | Yes | Yes |
threads.terminated.count | Current number of threads in the terminated state (completed execution) | Yes | Yes | Yes |
threads.timed_waiting.count | Current number of threads in the timed_waiting state | Yes | Yes | Yes |
threads.waiting.count | Current number of threads in the waiting state | Yes | Yes | Yes |
jvm.gc.overhead.percent | An approximate percentage of CPU time used by garbage collection activities | Yes | Yes | No |
Parameters to Monitor for Scaling Capacity
The following parameters, derived from the queries.json
, can help identify when additional engines or vertical scaling are needed to maintain performance.
Query Execution Errors
By reviewing the outcomeReason
field in queries.json
, you can identify resource-related issues and take proactive steps, such as scaling engines or redistributing workloads, to maintain performance and stability.
Error Type (outcomeReason ) | Recommended Threshold | Action |
---|---|---|
OUT_OF_MEMORY | 1% of queries running out of direct memory | Add an engine and move workload |
RESOURCE ERROR | 1% of queries running out of heap memory | Add an engine and move workload |
ExecutionSetupException | 1% of queries exhibiting node disconnects | Add an engine and move workload |
ChannelClosedException (fabric server) | 1% of queries exhibiting node disconnects | Add an engine and move workload |
CONNECTION ERROR: Exceeded timeout | 1% of queries exhibiting node disconnects | Add an engine and move workload |
Job State Durations
Use the job state durations (provided in milliseconds) in the queries.json
to address SLA breaches.
Job State (queries.json ) | Recommended Threshold | Action |
---|---|---|
Total Duration (finish - start) | p90 SLA aligns with your needs | Add all the states below |
Pending (pendingTime) | p90 should not exceed 2000 milliseconds | Vertically scale the main coordinator node |
Metadata Retrieval (metadataRetrievalTime) | p90 should not exceed 5000 milliseconds | Switch to a table format if the raw data is Parquet |
Planning (planningTime) | p90 should not exceed 2000 milliseconds | Vertically scale the main coordinator node |
Queued (queuedTime) | p90 should not exceed 2000 milliseconds | Add an engine and move workload |
Execution Planning (executionPlanningTime) | p90 should not exceed 2000 milliseconds | Vertically scale the main coordinator node |
Starting (startingTime) | p90 should not exceed 2000 milliseconds | Add an engine and move workload |
Running (runningTime) | p90 SLA aligns with your needs | Add an engine and move workload |
All italicized values can be found in queries.json
as represented in the parentheses above.