Managing Workloads (Enterprise Edition)

[info] PREVIEW FEATURE

The Workload Management (WLM) feature provides the capability to manage cluster resources and workloads.

This is accomplished through defining a queue with specific characteristics (such as memory limits, CPU priority, and queueing and runtime timeouts) and then defining rules that specify which query is assigned to which queue.

This capability is particularly important in Dremio clusters that are deployed in multi-tenant environments with a variety of workloads ranging from exploratory queries to scheduled reporting queries. In particular, WLM provides the following:

  • Provides workload isolation and predictability for users and groups.
  • Ease of configuration for SLA and workload management.
  • Predictable and efficient utilization of cluster resources.

Enabling WLM Preview

To request access to this feature and enable, please send an email to preview@dremio.com.

[info] In this Preview, when you enable WLM for the first time, your legacy queue settings are imported to create WLM queues and rules. Your legacy queue settings are imported only once and that is the first time WLM is enabled.

For example, your legacy queue settings will not be re-imported if you do the following:

  1. Remove WLM enablement.
  2. Make changes to your legacy queue settings.
  3. Re-enable WLM.

To access WLM, navigate to the following UI location:

Admin > Workload Management > Queues | Rules

Homogeneous Environment

Workload management works optimally in a homogeneous environment in terms of memory. That is, when each node in the Dremio cluster has the same amount of memory.

If you have a heterogeneous environment (nodes in a Dremio cluster have different amounts of memory), you should plan for the lowest common denominator (lowest amount of memory associated with a node).

How Workload Management Works

You set up workload management by:

  1. Creating queues with different criteria depending on how you want to manage your jobs.
  2. Creating rules that do the following:
    • Establish conditions that target specific queries.
    • Assign queries that fit the conditions to specific queues.

The following diagram shows a basic WLM flow when a job query is submitted where Rule1 assigns job queries to Queue1, Rule2 to Queue2, Rule3 to Queue3, and so on.

The Rule's conditions that a job query matches, determines which queue the job query is sent to.

Queues

Dremio allows you to define job queues, to which queries can be assigned, based on defined rules.

Configuration Options

Category Option Description
Overview Name Name of the job queue.
CPU Priority Relative CPU time a job receives compared to other CPU priorities.

Priority levels:

  • Critical
  • High
  • Medium (default)
  • Low
  • Background
Concurrency Limit Limits the number of concurrent queries running in the queue.

Enabled by default; set to 10 queries.

Memory Management Job Memory Reservation (Experimental) Reserves memory per node on a per job basis.

Disabled by default.

Queue memory Limit Limits memory per node allocated on a per queue basis.

Disabled by default.

Job memory Limit Limits memory per node allocated on a per queue basis.

Disabled by default.

Time Limits Enqueued Time Limit Limits the enqueued time after which the query is cancelled.

Enabled by default; set to 5 minutes

Query Runtime Limit Limits the query runtime time after which the query is cancelled.

Disabled by default.

Default Queues

Dremio provides the following generic queues as a starting point for customization:

Rule Name Description
UI Previews CPU priority: High
Concurrency limit: 50
Low Cost User Queries CPU priority: Medium
Concurrency limit: 25
High Cost User Queries CPU priority: Medium
Concurrency limit: 5
Low Cost Reflections CPU priority: Low
Concurrency limit: 25
High Cost Reflections CPU priority: Low
Concurrency limit: 1

Queue and Job Limits

If you set up concurrency limits, ensure that you allocated job limits accordingly. For example, if you allow multiple concurrent jobs and if the total limit for each job is higher than the queue limit, then a job may fail if memory is consumed by other concurrent jobs.

Rules

Rules allow you to specify conditions that the scheduling module utilizes to either assign the query to a queue or reject the query. Rules are applied in order and are used to target queries to a specific queue or to reject queries. The first rule that a query matches determines the action taken.

You can create custom rules, use the provided rule templates, or modify the provided rule templates.

Conditions Used in Rules

The following conditions, as well as Dremio SQL functions, can be used in combination when creating rules to target specific jobs.

  • users
  • group membership (members)
  • query type
  • query cost - to find the cost of a query, navigate to:
    Job Profile > Resource Allocation > Query Cost
  • Other functions that are supported in sql: such as date/time

Query Types

Query Type Description
JDBC Jobs submitted from JDBC.
ODBC Jobs submitted from ODBC.
REST Jobs submitted from REST.
Reflections Reflection creation and maintenance jobs.
UI Run Full query runs in the UI.
UI Preview Preview queries in the UI.
UI Download Download queries in the UI.
Internal Preview Dataset formatting previews. Reflection recommender analysis queries.
Internal Run Node activity query. Data curation histogram and transformation suggestion queries.

Rule Examples

User Example

USER in ('JRyan','PDirk','CPhillips')

Group Membership Example

is_member('MarketingOps') OR is_member('Engineering')

Available Job Types Example

query_type() IN ('JDBC', 'ODBC', 'UI Run')

Query Plan Cost Example

query_cost() > 1000000

Combined Conditions Example
The following example includes targeting a job by a combination of user, group membership, query type, query cost, and time of day.

(
  USER IN ('JRyan', 'PDirk', 'CPhillips')
  OR  is_member('superadmins')
)
AND query_type IN ( 'ODBC')
AND query_cost > 3000000
AND EXTRACT(HOUR FROM CURRENT_TIME)
  BETWEEN 9 AND 18

Default Rules

Dremio provides the following generic rules as a starting point for customization.

Order Rule Name Available Query Types Rule Queue Name
1 UI Previews UI Preview query_type() = 'UI Preview' UI Previews
2 Low Cost User Queries JDBC, ODBC, REST, UI Run, UI Download, Internal Preview, Internal Run query_cost() < 30000000 Low Cost User Queries
3 High Cost User Queries JDBC, ODBC, REST, UI Run, UI Download, Internal Preview, Internal Run query_cost() >= 30000000 High Cost User Queries
4 Low Cost Reflections Reflections query_type() = 'Reflections' AND query_cost() < 30000000 Low Cost Reflections
5 High Cost Reflections Reflections query_type() = 'Reflections' AND query_cost() >= 30000000 High Cost Reflections
6 All Other Queries Reject

[important]

The default setup covers all query types. When setting your own custom rules, ensure that all query types are taken into account. Otherwise, you may experience unexpected behavior when using Dremio.

Limitations

Queue Modificaton

Once a queue has been defined, it can not be modified. If you want to change the definition of a queue, you need to do the following:

  1. Change the rules so that jobs cannot be assigned to the old queue.
  2. Ensure that no jobs are running using the old queue.
  3. Delete the old queue. Note that if jobs are running using the old queue, the queue can't be deleted.
  4. Create a new queue to assign new job to.

Job Memory Reservations

Job memory reservations is an experimental feature and may be deprecated in Workload Management GA. Be sure to plan appropriately for your queue and job limits.

  • If you set job memory reservations for your jobs, then the additional memory resources are allocated for each concurrent job. This reservation amount must be taken into account when planning your queue and job memory limits.

    For example, if you have a queue with 10 concurrent jobs and 10GB memory reserved per job, this queue will always reserve 100GB of cluster capacity.

  • If you use job memory reservations, be aware that if a job is cancelled, that job's fragments may not be cancelled immediately. Until that job has completely stopped, it's memory resource allocation may not be available for a subsequent job. Thus, subsequent jobs may timeout and fail because the job memory reservation could not be reallocated to the new job.


results matching ""

    No results matching ""