Monitoring Nodes

There are various approaches for operational monitoring of Dremio nodes.

Health Checks

The Server Status REST end-point can be used to check the health of a node. See Server Status for more information.


On a responsive node, the following response should be returned:


Failing to complete a request to this end-point or delays in getting a response might signal problems with the node.

Please note that, when using internal user management, if no users have been defined, this end point returns the following error:

'No User Available'

Blacklisting Nodes

As of Dremio 3.3, Dremio allows you to blacklist a node from being used for execution. This means that you can ensure that a node in your Dremio cluster can be excluded from being used to execute queries.

You may wish to blacklist a node because queries are running slowly and you want to do some diagnostics or because you want to upgrade some component on the node (such as other firmware).

[info] Limitations

Blacklisting will not work for queries that are accelerated and when the reflections are stored on PDFS.

To blacklist a node:

  1. From the Dremio UI, navigate to Admin > Cluster > Node Activity.
  2. Click on the Actions icon for the node.
  3. Select Avoid using this executor node.
  4. Select Yes, I am sure. to the Node status change popup.

Alternatively, to whitelist a node that has been blacklisted:

  1. From the Dremio UI, navigate to Admin > Cluster > Node Activity.
  2. From the Actions icon for the whitelisted node, select Use this node for execution.
  3. Select Yes, I am sure. to the Node status change popup.

Metrics Monitoring using JMX

Monitoring the following metrics are recommended:

  • Heap memory usage and GC frequency
  • Direct memory usage
  • Used lightweight threads

Heap Memory Usage and GC Frequency

Dremio uses heap memory for planning, coordination, UI serving, query management, connection management, and some types of record reading etc. type of tasks. Heap memory is expected to be higher in coordinator nodes in high concurrency deployments. When observed together, continued high garbage collection (GC) frequency and high heap memory usage would indicate an undersized cluster/node.

Heap usage can be tracked via memory.heap.usage.

Garbage collection can be tracked by monitoring cumulative counts and times over time: gc.PS-MarkSweep.count, gc.PS-MarkSweep.time, gc.PS-Scavenge.count, gc.PS-Scavenge.time. Please note that garbage collection logging is already enabled by default on all Dremio nodes.

Direct Memory Usage

Dremio uses direct memory for query execution tasks — directly affecting performance and concurrency. Dremio also uses direct memory for RPC communication between executor and coordinator nodes, as well as communicating with the end users. Direct memory is expected to be used heavily during query execution on the executor nodes. Continued high direct memory usage would indicate the cluster/node approaching it’s capacity.

Direct memory allocated/used by the execution engine can be tracked via dremio.memory.direct_current. Total direct memory given to the JVM can be tracked viadremio.memory.jvm_direct_current.

Used Lightweight Threads

Depending on how much work Dremio is doing, the system might be aggressively parallel. Sometimes this can mean that Dremio is designed to allow for even a single query to use as many cores as available to the process. The number of threads running on each executor node describes the total amount of parallelization Dremio is using. This number maybe substantially more than the number of cores as Dremio is very effective at scheduling between different threads.

If this metric goes more than 10-20 times the number of logical cores, you are probably slowing down individual queries due to contention. You can better understand the impact of contention on a per query basis by looking at the query profile and viewing “Wait Time” value under the “Thread Overview” section per phase. This describes how long each lightweight thread has work available to do but is not scheduled due to CPU contention. Note that this thread count is not directly correlated to kernel threads.

Lightweight threads can be tracked via

Available JMX Metrics

  • buffer-pool.mapped.capacity
  • buffer-pool.mapped.count
  • buffer-pool.mapped.used
  • gc.PS-MarkSweep.count
  • gc.PS-MarkSweep.time
  • gc.PS-Scavenge.count
  • gc.PS-Scavenge.time
  • memory.heap.committed
  • memory.heap.init
  • memory.heap.max
  • memory.heap.usage
  • memory.heap.used
  • memory.non-heap.committed
  • memory.non-heap.init
  • memory.non-heap.max
  • memory.non-heap.usage
  • memory.non-heap.used
  • memory.pools.Code-Cache.usage
  • memory.pools.PS-Eden-Space.usage
  • memory.pools.PS-Old-Gen.usage
  • memory.pools.PS-Perm-Gen.usage
  • memory.pools.PS-Survivor-Space.usage
  • rpc.user.current
  • rpc.user.peak
  • rpcbit.control.current
  • rpcbit.control.peak
  • threads.blocked.count
  • threads.count
  • threads.daemon.count
  • threads.deadlocks
  • threads.runnable.count
  • threads.terminated.count
  • threads.timed_waiting.count
    • threads.waiting.count

Remote JMX Monitoring

To set up your environment for remote monitoring, add the following properties to each Dremio node's dremio-env file under the DREMIO_JAVA_SERVER_EXTRA_OPTS property (without newlines):

[info] Note: Local monitoring is enabled by default.


[warning] Warning

In production environments, Dremio strongly recommends using both SSL client certificates to authenticate the client host and password authentication for user management. See Monitoring and Management Using JMX Technology, Out-of-the-Box Monitoring and Management Properties section, for more configuration information.

results matching ""

    No results matching ""