Skip to main content
Version: current [25.x]

Monitoring Dremio Nodes

There are various approaches for operational monitoring of Dremio nodes. This topic discusses collecting JMX metrics, but Dremio administrators can use other types of metrics, such as system telemetry.

Excluding Nodes

As a Dremio administrator, you can exclude a cluster node from being used for query execution. You may wish to exclude a node because queries are running slowly and you want to do some diagnostics or because you want to upgrade some component on the node, such as other firmware.

note

Excluding a node does not work for queries that are accelerated and when the reflections are stored on PDFS.

To exclude a node:

  1. From the Dremio UI, navigate to Admin > Cluster > Node Activity.
  2. Click on the Actions icon for the node.
  3. Select Avoid using this executor node.
  4. Select Yes, I am sure. to the Node status change popup.

Alternatively, to use a node that has been excluded:

  1. From the Dremio UI, navigate to Admin > Cluster > Node Activity.
  2. From the Actions icon for the whitelisted node, select Use this node for execution.
  3. Select Yes, I am sure. to the Node status change popup.

Monitoring JMX Metrics

Dremio recommends monitoring the following metrics:

  • Heap memory usage and GC frequency
  • Direct memory usage
  • Used lightweight threads

Heap Memory Usage and GC Frequency

Dremio uses heap memory for planning, coordination, UI serving, query management, connection management, and some types of record reading etc. type of tasks. Heap memory is expected to be higher in coordinator nodes in high concurrency deployments. When observed together, continued high garbage collection (GC) frequency and high heap memory usage would indicate an undersized cluster/node.

Heap usage can be tracked via memory.heap.usage.

Garbage collection can be tracked by monitoring cumulative counts and times over time: gc.PS-MarkSweep.count, gc.PS-MarkSweep.time, gc.PS-Scavenge.count, gc.PS-Scavenge.time. Please note that garbage collection logging is already enabled by default on all Dremio nodes.

Direct Memory Usage

Dremio uses direct memory for query execution tasks — directly affecting performance and concurrency. Dremio also uses direct memory for RPC communication between executor and coordinator nodes, as well as communicating with the end users. Direct memory is expected to be used heavily during query execution on the executor nodes. Continued high direct memory usage would indicate the cluster/node approaching it’s capacity.

Direct memory allocated/used by the execution engine can be tracked via dremio.memory.direct_current. Total direct memory given to the JVM can be tracked via dremio.memory.jvm_direct_current.

Used Lightweight Threads

Depending on how much work Dremio is doing, the system might be aggressively parallel. Sometimes this can mean that Dremio is designed to allow for even a single query to use as many cores as available to the process. The number of threads running on each executor node describes the total amount of parallelization Dremio is using. This number may be substantially more than the number of cores as Dremio is very effective at scheduling between different threads.

If this metric goes more than 10-20 times the number of logical cores, you are probably slowing down individual queries due to contention. You can better understand the impact of contention on a per query basis by looking at the query profile and viewing "Wait Time" value under the "Thread Overview" section per phase. This describes how long each lightweight thread has work available to do but is not scheduled due to CPU contention. Note that this thread count is not directly correlated to kernel threads.

Lightweight threads can be tracked via dremio.exec.work.running_fragments.

Enabling Node Metrics

Dremio enables node monitoring by default. To manually enable node monitoring:

  1. Add the following properties to dremio-env on each Dremio node in your deployment:

    Properties to add to dremio-env file
    DREMIO_JAVA_SERVER_EXTRA_OPTS='
    -Dcom.sun.management.jmxremote.port=<monitoring port>
    -Dcom.sun.management.jmxremote.authenticate=false
    -Dcom.sun.management.jmxremote.ssl=false'
  2. Create a telemetry configuration file named dremio-telemetry.yaml in the $DREMIO_HOME/conf folder with the following contents:

    Properties to add to dremio-telemetry.yaml file
    # Control whether the dremio-telemetry.yaml file should automatically reload
    # and if so, the interval at which it should reload.
    auto-reload:
    enabled: True
    period: 90
    unit: SECONDS

    metrics:
    - name: jmx_reporter
    comment: >
    Publish metrics on jmx
    reporter:
    type: jmx
    rate: SECONDS
    duration: MILLISECONDS

Accessing Node Metrics

You can access your Dremio node metrics using jconsole or another Java Agent that collects JMX metrics.

To access node metrics with jconsole:

  1. Run the following command on a Dremio host machine:

    jconsole command
    jconsole
  2. In the JConsole: New Connection modal, double-click com.dremio.dac.daemon.DremioDaemon.

  3. In the Java Monitoring & Management Console, click the MBeans tab.

  4. Click metrics, then click gauges.

  5. Click the metric that you want to view.

  6. Click Attributes, then click Value.

    jconsole displays the value for the selected metric.

caution

In production environments, Dremio strongly recommends using both SSL client certificates to authenticate the client host and password authentication for user management. See Monitoring and Management Using JMX Technology, Out-of-the-Box Monitoring and Management Properties section, for more configuration information.

Available JMX Metrics

The following table describes the JMX metrics provided by Dremio and specifies which Dremio node roles support them:

Metric NameDescriptionMaster CoordinatorSecondary CoordinatorExecutor
jobs.activeCurrently active jobsYesYesNo
jobs.command_pool.active_threadsCurrently active commandsYesYesYes
jobs.command_pool.queue_sizeCurrent size of queued commandsYesYesYes
jobs.long_runningTop 25 longest running queries (minimum 10s in length) with periodic decayYesNoNo
jobs.active_15mNumber of jobs in 15 minute periodYesNoNo
jobs.failedNumber of failed jobs in 15 minute periodYesNoNo
jobs.active_1dNumber of active jobs in one day periodYesNoNo
jobs.failed_1dNumber of failed jobs in one day periodYesNoNo
jobs.queue.<queue_name>.waitingNumber of current waiting jobsYesNoNo
fragments.activeCurrently active fragmentsNoNoYes
buffer-pool.direct.capacityTotal capacity of the buffers in the direct poolYesYesYes
buffer-pool.direct.countNumber of buffers in the direct poolYesYesYes
buffer-pool.direct.usedMemory used for the direct buffer poolYesYesYes
buffer-pool.mapped.capacityTotal capacity of the buffers in the mapped poolYesYesYes
buffer-pool.mapped.countNumber of buffers in the mapped poolYesYesYes
buffer-pool.mapped.usedMemory used for the mapped buffer poolYesYesYes
dremio.memory.direct_currentDirect memory allocated/used by the execution engineYesYesYes
dremio.memory.jvm_direct_currentTotal direct memory given to the JVMYesYesYes
dremio.memory.remaining_heap_allocationsRemaining heap allocation space for DremioYesYesYes
dremio.G1-Young-Generation.countYoung generation of garbage collection objectsYesYes
dremio.G1-Young-Generation.timeTime target for young generation of garbage collection objectsYesYesYes
gc.G1-Old-Generation.countOld Generation of Garbage collected objectsYesYesYes
gc.G1-Old-Generation.timeTime target for old generation of GC objectsYesYesYes
kvstore.*Merics related to the KVStoreYesNoNo
maestro.activeMaestro activeYesYesNo
memory.heap.committedAmount of heap memory that is committedYesYesYes
memory.heap.initAmount of heap memory requested at initializationYesYesYes
memory.heap.maxMaximum amount of heap memory that can be usedYesYesYes
memory.heap.usageRatio of memory.heap.used to memory.heap.maxYesYesYes
memory.heap.usedAmount of used heap memoryYesYesYes
memory.non-heap.committedAmount of non-heap memory that is committedYesYesYes
memory.non-heap.initAmount of non-heap memory requested at initializationYesYesYes
memory.non-heap.maxMaximum amount of non-heap memory that can be usedYesYesYes
memory.non-heap.usageRatio of memory.non-heap.used to memory.non-heap.maxYesYesYes
memory.non-heap.usedAmount of used non-heap memoryYesYesYes
memory.pools.Code-Cache.initMemory committed at initialization from the memory pool used for compilation and storage of native codeYesYesYes
memory.pools.Code-Cache.usageRatio of memory.pools.Code-Cache.used to memory.pools.Code-Cache.maxYesYesYes
memory.pools.Compressed-Class-Space.committedMemory committed from the memory pool used for class metadataYesYesYes
memory.pools.Compressed-Class-Space.initMemory requested at initialization from the memory pool used for class metadataYesYesYes
memory.pools.Compressed-Class-Space.maxMaximum size of the memory pool used for class metadataYesYesYes
memory.pools.Compressed-Class-Space.usageCollection usage from the memory pool used for class metadataYesYesYes
memory.pools.Compressed-Class-Space.usedMemory used by the memory pool used for class metadataYesYesYes
memory.pools.PS-Eden-Space.usageRatio of PS-Eden-Space.used to PS-Eden-Space.maxYesYesNo
memory.pools.PS-Old-Gen.usageRatio of PS-Old-Gen.used to PS-Old-Gen.maxYesYesNo
memory.pools.PS-Survivor-Space.usageRatio of PS-Survivor-Space.used to PS-Survivor-Space.maxYesYesNo
memory.total.committedAmount of memory that is committed to useYesYesYes
memory.total.initAmount of memory requested at initialization in bytes.YesYesYes
memory.total.maxMaximum amount of memory that can be usedYesYesYes
memory.total.usedAmount of used memoryYesYesYes
planner.plan_cache_entriesNumber of plan cache entriesYesYesNo
planner.plan_cache_syncTime taken to invalidate plan cache entries due to reflections being created, deleted, or refreshedYesYesNo
planner.view_schema_learningCounter for view schema learningYesYesNo
reflections.materialization_cache_entriesNumber of materialization cache entriesYesYesNo
reflections.materialization_cache_syncTime taken to update the materialization cacheYesYesNo
reflections.failedScheduled reflections that have failed and won't be retriedYesNoNo
reflections.unknownReflections for which an error occurred in the reflection manager and that could not be retriedYesNoNo
reflections.activeCurrently active reflectionsYesNoNo
reflections.refreshingReflections that are currently refreshing or pending a refreshYesNoNo
reflections.manager_syncTime taken to run reflection managementYesYesNo
reflections.materialization_cache_errorsNumber of reflections that could not be successfully loaded into the materialization cache and thus not available for query accelerationYesYesNo
rpc.bit.data.currentMaximum amount of memory used by all RPC connectionsNoNoYes
rpc.bit.data.peakPeak amount of memory used by all RPC connectionsNoNoYes
rpc.data.currentMaximum amount of memory used by all RPC connectionsNoNoYes
rpc.failure_15mRPC connection failures in 15 minute periodYesYesYes
rpc.failure_1dRPC connection failures in 1 day periodYesYesYes
rpc.peersNumber of active peer connectionsYesYesYes
threads.blocked.countNumber of currently blocked threadsYesYesYes
threads.countCurrent number of active and idle threadsYesYesYes
threads.daemon.countNumber of currently available active daemon threadsYesYesYes
threads.deadlock.countNumber of currently deadlocked threadsYesYesYes
threads.deadlocksCollection of information about the currently deadlocked threadsYesYesYes
threads.new.countCurrent number of threads in new state (not yet started)YesYesYes
threads.runnable.countCurrent number of threads in runnable state (executing)YesYesYes
threads.terminated.countCurrent number of threads in terminated state (completed execution)YesYesYes
threads.timed_waiting.countCurrent number of threads in the timed_waiting stateYesYesYes
threads.waiting.countCurrent number of threads in the waiting stateYesYesYes
planner.plan_cache_queriesCounter for SELECT query plan cache hits and misses, including an outcome tag for hit-and-miss reasonsYesYesNo
planner.plan_cache_putsCounter for plan cache puts, including an outcome tag for adding or not adding an entryYesYesNo