Skip to main content
Version: current [24.3.x]

Initial Workload Management Settings

When Dremio is first installed, there are no guardrails put in place out of the box to restrict how much memory any one queue or any one query in a queue can consume out of the total amount of memory available. This leaves Dremio open to potential out-of-memory issues if a user issues a large query which requires more memory than is available on the Dremio executors. In addition, the default queue concurrencies are a little high and could lead to memory exhaustion if many smaller queries (or up to 10 large ones) also end up needing to collectively consume more memory than is available on the Dremio Executors.

This document provides recommendations for how to set up workload management queue and query memory limits and queue concurrency limits immediately after Dremio is installed to ensure Dremio will remain operational under memory pressure. The document covers the situations where Dremio is configured with and without engines.

tip

This document assumes Dremio has been installed and is running.

Configure Workload Management for a New installations

The following sections provide recommendations for how to set up some initial queue and query memory limits based upon whether Dremio is or isn’t configured to use multiple engines.

No Engines or Single Engine Configuration

For Dremio installations where no engines or just a single engine is configured and therefore all queries get routed to the same set of executors, it is important to set up queue and query memory limits and set sensible concurrency limits to prevent rogue queries from bringing down Executors unnecessarily. It is a lot better to have Dremio identify and cancel a single query because it recognizes it exceeds the set memory limits than it is to let that query run and cause out-of-memory issues on an Executor which will then cause all queries being handled by that executor to fail.

The default queue settings for an out-of-the-box Dremio install are shown below, notice how no limits are set:

One important value to make note of is the Average node memory, which in this example is 16384 (or 16GB), this value tells us the maximum amount of direct memory that is available on any one Executor. The value in this example is low. Consider that Dremio recommends 64GB or 128GB node sizes for Executors, which after giving memory to the OS Kernel and heap memory would typically leave 52GB or 112-116GB of direct memory per executor.

Regarding Queue Memory Limit per Node settings, the most important thing we can do when the initial installation is complete is ensure every queue has a limit set. Even if you set the value on every queue to 90-95% of the Average node memory this will significantly reduce the potential for the nodes to “lose communication” with Zookeeper and will prevent the executors from crashing if they encounter memory intensive queries because a small memory will always be there to keep the communication with zookeeper.

As a rule of thumb, the grand total queue memory limit per node summed across the Low and High cost user queries queues should not exceed 120% of the Average node memory value. The reason we allow this to exceed 100% is that it is fairly unlikely that both queues will experience maximum usage of memory at exactly the same time, therefore we allow some degree of overlap.

The low- and high-cost reflections queue memory limit should be set to at most the same values as the queue memory limit for the low- and high-cost user queries. Reflections typically run far less frequently than other query types and often they are triggered to run outside of normal working hours, therefore again we allow the sum of these values to exceed the Average node memory value.

However, if you find after making changed to conform to the rule of thumb above that you have too many queries failing due to not enough memory being available to a particular queue, then it is safe to increase the amount of memory allocated to the queue where queries are failing, however never go beyond 95% of the average node memory on any one queue.

In terms of job memory limits, for high-cost user queries we want to give Dremio the opportunity to execute the biggest queries, therefore we will let the biggest job consume up to approximately 50-70% of the total memory available, depending on the Average node memory setting. Low cost user queries typically consume far less memory and at most we would set a job memory limit of 50% of the queue memory limit or 5GB for one of these jobs, whichever is lower.

For UI Previews Dremio recommends both the queue and job memory limit be set to the maximum memory allocated to a job in the high cost user queries queue, it is highly unlikely that these memory limits will ever get reached, but this provides guardrails in case they do.

Regarding concurrency limits Dremio recommends the following initial concurrency settings, regardless of what the memory settings are:

Queue NameMax Concurrency Limit
High Cost Reflections1
High Cost User Queries5
Low Cost Reflections25
Low Cost User Queries25
UI Previews50

The following table summarizes the rules discussed above:

Table 1: Rules for Zero Engines or Single Engine

Queue NameMax Concurrency LimitQueue Memory Limit per NodeJob Memory Limit per Node
High Cost Reflections20.75 x Average Node Memory0.5 x Average Node Memory
High Cost User Queries30.75 x Average Node Memory0.5 x Average Node Memory
Low Cost Reflections50.4375 x Average Node Memory0.1875 x Average Node Memory
Low Cost User Queries200.4375 x Average Node Memory0.1875 x Average Node Memory
UI Previews1000.5 x Average Node Memory0.5 x Average Node Memory

* 0.1875 x Average Node Memory or 5GB, which-ever is lower

The following sections provide examples of sensible memory settings based on various different Average node memory values.

Average Node Memory = 16384 (16GB)

Queue NameMax Concurrency LimitQueue Memory Limit per NodeJob Memory Limit per Node
High Cost Reflections212GB8GB
High Cost User Queries312GB8GB
Low Cost Reflections57GB3GB
Low Cost User Queries207GB3GB
UI Previews1008GB8GB

Average Node Memory = 32768 (32GB)

Queue NameMax Concurrency LimitQueue Memory Limit per NodeJob Memory Limit per Node
High Cost Reflections224GB16GB
High Cost User Queries324GB16GB
Low Cost Reflections514GB5GB
Low Cost User Queries2014GB5GB
UI Previews10016GB16GB

Average Node Memory = 53248 (52GB)

Queue NameMax Concurrency LimitQueue Memory Limit per NodeJob Memory Limit per Node
High Cost Reflections240GB25GB
High Cost User Queries340GB25GB
Low Cost Reflections520GB5GB
Low Cost User Queries2020GB5GB
UI Previews10025GB25GB

Average Node Memory = 114688 (112GB)

Queue NameMax Concurrency LimitQueue Memory Limit per NodeJob Memory Limit per Node
High Cost Reflections284GB56GB
High Cost User Queries384GB56GB
Low Cost Reflections550GB5GB
Low Cost User Queries2050GB5GB
UI Previews10056GB56GB

Multi-Engine Configuration (AWSE or Kubernetes)

For Dremio installations on AWSE or Kubernetes where multiple engines are configured, we have to understand 1) how many nodes are in the engine and 2) whether reflections will be serviced by a dedicated engine or whether the reflection refreshes will be serviced by all nodes across all engines. In the example below we assume reflections are NOT serviced by a dedicated engine.

We also assume that there is a 1-to-1 mapping between a query queue and an engine.

The reason for the -5GB in these calculations is to ensure that when reflections run there is always a portion of each node in the engine that won’t get utilized by the queries that run on that engine, which ensures there will always be some free memory available on each node to service reflections.

Table 2: Rules for Multiple Engines

Queue NameMax Concurrency LimitQueue Memory Limit per NodeJob Memory Limit per Node
High Cost Reflections2Average Node Memory / 2Average Node Memory / 4
High Cost User Queries3(Average Node Memory - 5GB)/2(Average Node Memory - 5GB) / 4
Low Cost Reflections5Average Node Memory / 30.1875 x Average Node Memory
Low Cost User Queries20(Average Node Memory - 5GB)/30.1875 x Average Node Memory
UI Previews100(Average Node Memory - 5GB) / 2(Average Node Memory - 5GB) / 2

* 0.1875 x Average Node Memory or 5GB, which-ever is lower

** Average Direct memory needs to be computed per engine

For cases where a dedicated engine has been provisioned for reflections, Table 1 should be followed.

Configuring Queues Programmatically

A newly installed Dremio cluster comes with five queues with default configurations. Dremio provides APIs to update memory configuration of queues. An approach to programmatically configure queues, based on cluster size, is provided below.

  • Dremio needs the auth token to make any API call. Auth token can be obtained using the API /apiv2/login POST method. More details can be found here: Dremio Login
  • Next, you should calculate the average direct memory by engine by querying Dremio system tables sys.nodes and sys.memory. You can submit the query via API or JDBC.
  • Using the average direct memory, the following parameters can be computed for each queue:
    • Max Concurrency Limit
    • Queue Memory Limit per node
    • Job Memory Limit per node (based on your implementation, use Tables 1 & 2 above to compute these values from average direct memory)
  • For each queue, Dremio generates a unique Queue ID. Get the current queue configuration using GET /api/v3/wlm/queue. Additional details can be found in Dremio All Queues
  • For each queue, update Dremio with the modified queue configuration using the API PUT /api/v3/wlm/queue/{id}. Additional details can be found in Update Queue.