Monitoring Dremio Nodes

There are various approaches for operational monitoring of Dremio nodes. This topic discusses collecting JMX metrics, but Dremio administrators can other types of metrics, such as system telemetry.

Checking Node Health

The Server Status REST end-point can be used to check the health of a node. See Server Status for more information.

/apiv2/server_status

On a responsive node, the following response should be returned:

'OK'

Failing to complete a request to this end-point or delays in getting a response might signal problems with the node.

Please note that, when using internal user management, if no users have been defined, this end point returns the following error:

'No User Available'

Blacklisting Nodes

Dremio allows administrators to blacklist a node in a Dremio cluster from being used for query execution. You may wish to blacklist a node because queries are running slowly and you want to do some diagnostics or because you want to upgrade some component on the node, such as other firmware.

[info] Limitations

Blacklisting will not work for queries that are accelerated and when the reflections are stored on PDFS.

To blacklist a node:

  1. From the Dremio UI, navigate to Admin > Cluster > Node Activity.
  2. Click on the Actions icon for the node.
  3. Select Avoid using this executor node.
  4. Select Yes, I am sure. to the Node status change popup.

Alternatively, to whitelist a node that has been blacklisted:

  1. From the Dremio UI, navigate to Admin > Cluster > Node Activity.
  2. From the Actions icon for the whitelisted node, select Use this node for execution.
  3. Select Yes, I am sure. to the Node status change popup.

Monitoring JMX Metrics

Dremio recommends monitoring the following metrics:

  • Heap memory usage and GC frequency
  • Direct memory usage
  • Used lightweight threads

Heap Memory Usage and GC Frequency

Dremio uses heap memory for planning, coordination, UI serving, query management, connection management, and some types of record reading etc. type of tasks. Heap memory is expected to be higher in coordinator nodes in high concurrency deployments. When observed together, continued high garbage collection (GC) frequency and high heap memory usage would indicate an undersized cluster/node.

Heap usage can be tracked via memory.heap.usage.

Garbage collection can be tracked by monitoring cumulative counts and times over time: gc.PS-MarkSweep.count, gc.PS-MarkSweep.time, gc.PS-Scavenge.count, gc.PS-Scavenge.time. Please note that garbage collection logging is already enabled by default on all Dremio nodes.

Direct Memory Usage

Dremio uses direct memory for query execution tasks — directly affecting performance and concurrency. Dremio also uses direct memory for RPC communication between executor and coordinator nodes, as well as communicating with the end users. Direct memory is expected to be used heavily during query execution on the executor nodes. Continued high direct memory usage would indicate the cluster/node approaching it’s capacity.

Direct memory allocated/used by the execution engine can be tracked via dremio.memory.direct_current. Total direct memory given to the JVM can be tracked via dremio.memory.jvm_direct_current.

Used Lightweight Threads

Depending on how much work Dremio is doing, the system might be aggressively parallel. Sometimes this can mean that Dremio is designed to allow for even a single query to use as many cores as available to the process. The number of threads running on each executor node describes the total amount of parallelization Dremio is using. This number maybe substantially more than the number of cores as Dremio is very effective at scheduling between different threads.

If this metric goes more than 10-20 times the number of logical cores, you are probably slowing down individual queries due to contention. You can better understand the impact of contention on a per query basis by looking at the query profile and viewing “Wait Time” value under the “Thread Overview” section per phase. This describes how long each lightweight thread has work available to do but is not scheduled due to CPU contention. Note that this thread count is not directly correlated to kernel threads.

Lightweight threads can be tracked via dremio.exec.work.running_fragments.

Enabling Node Metrics

Dremio enables node monitoring by default. To manually enable node monitoring, add the following properties to the dremio-env file on each Dremio node in your deployment.

DREMIO_JAVA_SERVER_EXTRA_OPTS='
-Dcom.sun.management.jmxremote.port=<monitoring port>
-Dcom.sun.management.jmxremote.authenticate=false 
-Dcom.sun.management.jmxremote.ssl=false'

[warning] Warning

In production environments, Dremio strongly recommends using both SSL client certificates to authenticate the client host and password authentication for user management. See Monitoring and Management Using JMX Technology, Out-of-the-Box Monitoring and Management Properties section, for more configuration information.

Available JMX Metrics

The following table describes the JMX metrics provided by Dremio and specifies which Dremio node roles support them:

Metric Name Description Master Coordinator Secondary Coordinator Executor
jobs.active Currently active jobs Yes Yes No
jobs.command_pool.active_threads Currently active commands Yes Yes Yes
jobs.command_pool.queue_size Current size of queued commands Yes Yes Yes
jobs.long_running Top 25 longest running queries (minimum 10s in length) with periodic decay Yes No No
jobs.active_15m Number of jobs in 15 minute period Yes No No
jobs.failed Number of failed jobs in 15 minute period Yes No No
jobs.active_1d Number of active jobs in one day period Yes No No
jobs.failed_1d Number of failed jobs in one day period Yes No No
jobs.queue.<queue_name>.waiting Number of current waiting jobs Yes No No
fragments.active Currently active fragments No No Yes
buffer-pool.direct.capacity Total capacity of the buffers in the direct pool Yes Yes Yes
buffer-pool.direct.count Number of buffers in the direct pool Yes Yes Yes
buffer-pool.direct.used Memory used for the direct buffer pool Yes Yes Yes
buffer-pool.mapped.capacity Total capacity of the buffers in the mapped pool Yes Yes Yes
buffer-pool.mapped.count Number of buffers in the mapped pool Yes Yes Yes
buffer-pool.mapped.used Memory used for the mapped buffer pool Yes Yes Yes
dremio.memory.direct_current Direct memory allocated/used by the execution engine Yes Yes Yes
dremio.memory.jvm_direct_current Total direct memory given to the JVM Yes Yes Yes
dremio.memory.remaining_heap_allocations Remaining heap allocation space for Dremio Yes Yes Yes
dremio.G1-Young-Generation.count Young generation of garbage collection objects Yes Yes
dremio.G1-Young-Generation.time Time target for younf generation of garbage collection objects Yes Yes Yes
gc.G1-Old-Generation.count Old Generation of Garbage collected objects Yes Yes Yes
gc.G1-Old-Generation.time Time target for old generation of GC objects Yes Yes Yes
kvstore.* Merics related to the KVStore Yes No No
maestro.active Maestro active Yes Yes No
memory.heap.committed Amount of heap memory that is committed Yes Yes Yes
memory.heap.init Amount of heap memory requested at initialization Yes Yes Yes
memory.heap.max Maximum amount of heap memory that can be used Yes Yes Yes
memory.heap.usage Ratio of memory.heap.used to memory.heap.max Yes Yes Yes
memory.heap.used Amount of used heap memory Yes Yes Yes
memory.non-heap.committed Amount of non-heap memory that is committed Yes Yes Yes
memory.non-heap.init Amount of non-heap memory requested at initialization Yes Yes Yes
memory.non-heap.max Maximum amount of non-heap memory that can be used Yes Yes Yes
memory.non-heap.usage Ratio of memory.non-heap.used to memory.non-heap.max Yes Yes Yes
memory.non-heap.used Amount of used non-heap memory Yes Yes Yes
memory.pools.Code-Cache.init Memory committed at initialization from the memory pool used for compilation and storage of native code Yes Yes Yes
memory.pools.Code-Cache.usage Ratio of memory.pools.Code-Cache.used to memory.pools.Code-Cache.max Yes Yes Yes
memory.pools.Compressed-Class-Space.committed Memory committed from the memory pool used for class metadata Yes Yes Yes
memory.pools.Compressed-Class-Space.init Memory requested at initialization from the memory pool used for class metadata Yes Yes Yes
memory.pools.Compressed-Class-Space.max Maximum size of the memory pool used for class metadata Yes Yes Yes
memory.pools.Compressed-Class-Space.usage Collection usage from the memory pool used for class metadata Yes Yes Yes
memory.pools.Compressed-Class-Space.used Memory used by the memory pool used for class metadata Yes Yes Yes
memory.pools.PS-Eden-Space.usage Ratio of PS-Eden-Space.used to PS-Eden-Space.max Yes Yes No
memory.pools.PS-Old-Gen.usage Ratio of PS-Old-Gen.used to PS-Old-Gen.max Yes Yes No
memory.pools.PS-Survivor-Space.usage Ratio of PS-Survivor-Space.used to PS-Survivor-Space.max Yes Yes No
memory.total.committed Amount of memory that is committed to use Yes Yes Yes
memory.total.init Amount of memory requested at initialization in bytes. Yes Yes Yes
memory.total.max Maximum amount of memory that can be used Yes Yes Yes
memory.total.used Amount of used memory Yes Yes Yes
reflections.failed Currently failed data reflections Yes No No
reflections.unknown Data reflections with currently unknown status Yes No No
reflections.active Currently active data reflections Yes No No
reflections.refreshing Data reflections currently refreshing or pending a refresh Yes No No
rpc.bit.data.current Maximum amount of memory used by all RPC connections No No Yes
rpc.bit.data.peak Peak amount of memory used by all RPC connections No No Yes
rpc.data.current Maximum amount of memory used by all RPC connections No No Yes
rpc.failure_15m RPC connection failures in 15 minute period Yes Yes Yes
rpc.failure_1d RPC connection failures in 1 day period Yes Yes Yes
rpc.peers Number of active peer connections Yes Yes Yes
threads.blocked.count Number of currently blocked threads Yes Yes Yes
threads.count Current number of active and idle threads Yes Yes Yes
threads.daemon.count Number of currently available active daemon threads Yes Yes Yes
threads.deadlock.count Number of currently deadlocked threads Yes Yes Yes
threads.deadlocks Collection of information about the currently deadlocked threads Yes Yes Yes
threads.new.count Current number of threads in new state (not yet started) Yes Yes Yes
threads.runnable.count Current number of threads in runnable state (executing) Yes Yes Yes
threads.terminated.count Current number of threads in terminated state (completed execution) Yes Yes Yes
threads.timed_waiting.count Current number of threads in the timed_waiting state Yes Yes Yes
threads.waiting.count Current number of threads in the waiting state Yes Yes Yes

Accessing Node Metrics

You can access your Dremio node metrics using jconsole or another Java Agent that collects JMX metrics.

To access node metrics with jconsole:

  1. Run the following command on a Dremio host machine:
    jconsole
    
  2. In the JConsole: New Connection modal, double-click com.dremio.dac.daemon.DremioDaemon.

  3. In the Java Monitoring & Management Console, click the MBeans tab.

  4. Click metrics, then click gauges.
    JMX Metrics

  5. Click the metric that you want to view.

  6. Click Attributes, then click Value.

    jconsole displays the value for the selected metric. The following image indicates there is one active data reflection.
    reflections.active


results matching ""

    No results matching ""