Hash Aggregation Spilling

Dremio implements hash aggregation spilling. When memory limits are reached, Dremio spills the data to disk as and when necessary. The query tries to complete within the memory envelope under which Dremio operates.

This feature is useful for the following:

  • Memory-intensive hash aggregation queries (GROUP BY queries) that process large datasets.
  • Limited memory environments

A bare minimum amount of memory is needed by the hash aggregation operator to even start processing the first batch of data. As long as the memory given to hash aggregation is equal to or more than the minimum, Dremio can complete the query.

If the minimum memory is unavailable when the query is submitted, the job fails immediately (prior to starting the job) with an "Error: Failed to preallocate memory for single batch in partitions" message. This error message is shown in the exception stack details.

[info] Based on the amount of system memory that you have provisioned, Dremio automatically determines the memory envelope for memory intensive operators that can spill.

Enabling/Disabling Hash Aggregation Spilling

As of Dremio 3.1.3, Hash aggregation spilling is enabled by default. In previous Dremio releases, hash aggregation spilling is disabled by default.

[info] Performance

To fine-tune hash aggregation spilling performance, Contact Dremio Support at support@dremio.com.

To disable, navigate to Admin > Cluster > Support > Support Keys

  1. Add the exec.operator.aggregate.vectorize.use_spilling_operator key, click Show.
  2. When the support key displays, toggle the switch and click Save.

To re-enable, click Reset for the exec.operator.aggregate.vectorize.use_spilling_operator key.

Limitations

Hash aggregation queries complete successfully regardless of the amount of data being sent to the operator, as long as:

  • There is a minimal amount of memory available to the hash aggregation operator. Note that it is highly likely that a bare minimum of memory will be available.
  • There is sufficient disk space available for the spilled data.
  • The worst-case row of the incoming data can be stored within this initial memory allocation.

[info] For hash aggregation spilling information, go to Jobs > Details or Jobs > Profile > HASH_AGGREGATE_OPERATOR > Operator Metrics.


results matching ""

    No results matching ""