Skip to main content
Version: 24.3.x

Monitoring Dremio

To maximize your investment in Dremio and to proactively identify and resolve issues related to Dremio before they have a broader impact on workload, it’s important to deploy a good monitoring solution. The solution should ensure overall cluster health and performance. Dremio provides a large set of metrics that can be utilized for a monitoring solution along with infrastructure related metrics such as CPU, memory, etc.

Dremio can be deployed in a variety of ways, so the collection and reporting of infrastructure related metrics depends on your deployment model (physical or virtual host, Kubernetes, or cloud). You can collect metrics that are important for monitoring via JMX, JDBC or ODBC, or the REST API. You can monitor Dremio directly, with open-source tools such as Prometheus/Grafana, or with commercially available tools such as AppDynamics, Datadog, etc. It is important to align the monitoring solution with existing monitoring infrastructure.

JMX Metrics

Java Management Extensions (JMX) is a specification for monitoring and managing Java applications. Since Dremio is a Java application, it uses JMX to expose a number of important metrics using this framework.

note

For information about modifying dremio.conf to enable JMX metrics, see Enabling Node Metrics.

This configuration exposes JMX metrics on the http://<dremio host>:<port>/metrics URL. The following table lists important metrics, which Dremio node role supports them, and alert threshold details.

Metric NameDescriptionMaster CoordinatorSecondary CoordinatorExecutorAlert Threshold (if any)
jobs.activeCurrently active jobsYesYesNo
jobs.active_15mNumber of jobs in 15-minute periodYesNoNo
jobs.failed_15mNumber of failed jobs in 15-minute periodYesNoNoWARN: 5% of total jobs
CRITICAL: 10% of total jobs
jobs.queue.<queue_name>.waitingNumber of current waiting jobsYesNoNoWARN: > 0
CRITICAL: > 50% of allowed concurrency for the queue
dremio.memory.direct_currentDirect memory used by the execution engineYesYesYesWARN: 90% of allocated value
CRITICAL: 95% of allocated value
fragments.activeNumber of active query fragments (threads): Indicator on how starved Dremio is for CPU (monitor on executors)YesYesYesWARN: 0.9 x number of CPU cores on executors for 5 min
CRITICAL: 1 x number of CPU cores on executors for 5 min
gc.G1-Young-Generation.time/gc.G1-Young-Generation.count
gc.G1-Old-Generation.time/gc.G1-Old-Generation.count
Time per GC event (for young and old generations), assuming default G1GC is used for Garbage collectionYesYesYesWARN: > 10s
CRITICAL: > 1m
memory.heap.usageRatio of memory.heap.used to memory.heap.maxYesYesYesWARN: > 75%
CRITICAL: > 80%
Monitor on Coordinator. Coordinator’s JVM monitor automatically kills queries when heap utilization is > 85%.
reflections.failedCurrently failed data reflectionsYesNoNoWARN: > 0
CRITICAL: > 10% of all reflections
reflections.activeCurrently active data reflectionsYesNoNo
reflections.refreshingData reflections currently refreshing or pending a refreshYesNoNo
rpc.failure_15mRPC connection failures in 15 minute periodYesYesYesWARN: seen on ~10% of available executors
CRITICAL: seen on >25% of available executors

API Metrics

The Dremio coordinator exposes the REST API on the web UI port, which is 9047 by default. For more information about how to connect, authenticate, and submit requests, see API Reference.

API endpointDescriptionExpected ValuesAlert Threshold (if any)
GET /apiv2/server_statusCoordinator Status“OK”Anything other than “OK” or a slow response indicates that coordinator is either down or unhealthy
GET /api/v3/sourceSource Status“good”Any status other than “good”

SQL Metrics

SQL commands can be executed using ODBC, JDBC, or REST interfaces.

note

When executing SQL with the REST API, the API call does not return query results. The API returns the ID of the submitted query.

|SQL|Description|Alert Threshold (if any)| |---|---| |SELECT COUNT(*) FROM sys.memory |Canary query. Can be executed against a user dataset (small) or a Dremio internal table. This query should return results in ms. It is an indicator to overall cluster health.|WARN: > 30s
CRITICAL: > 1m|

POSIX Metrics

These metrics monitor metrics outside of Dremio that have an impact on stability and performance of Dremio.

|Command|Description|Alert Threshold (if any)| |---|---| |df -h <directory where Rocks DB is mounted>|Catalog DB free space. Catalog DB must have free space to create reflections, update profiles, and run jobs.|WARN: > 80% used
CRITICAL: > 90% used|