Hash Aggregation Spilling
Dremio implements hash aggregation spilling. When memory limits are reached, Dremio spills the data to disk as and when necessary. The query tries to complete within the memory envelope under which Dremio operates.
This feature is useful for the following:
- Memory-intensive hash aggregation queries (GROUP BY queries) that process large datasets.
- Limited memory environments
A bare minimum amount of memory is needed by hash aggregation operator to even start processing the first batch of data. As long as the memory given to hash aggregation is equal or more than the minimum, Dremio can complete the query.
If the minimum memory is unavailable when the query is submitted, the job fails immediately (prior to starting the job) with an "Error: Failed to preallocate memory for single batch in partitions" message. This error message is shown in the exception stack details.
[info] Based on the amount of system memory that you have provisioned, Dremio automatically determines the memory envelope for memory intensive operators that can spill.
Enabling/Disabling Hash Aggregation Spilling
As of Dremio 3.1.3, Hash aggregation spilling is enabled by default. In previous Dremio releases, hash aggregation spilling is disabled by default.
To fine-tune hash aggregation spilling performance, Contact Dremio Support at email@example.com.
To disable, navigate to Admin > Cluster > Support > Support Keys
- Add the
exec.operator.aggregate.vectorize.use_spilling_operatorkey, click Show.
- When the support key displays, toggle the switch and click Save.
To re-enable, click Reset for the
Hash aggregation queries complete successfully regardless of the amount of data being sent to the operator, as long as:
- There’s a bare minimum amount of memory is needed by hash aggregation operator to even start processing the first batch of data. As long as the memory given to hash aggregation is equal or more than the minimum, Dremio can complete the query. Note that it is highly likely that a bare minimum of memory will be available.
- There is sufficient disk space available for the spilled data.
- The worst-case row of the incoming data can be stored within this initial memory allocation.
[info] For hash aggregation spilling information, go to Jobs > Details or Jobs > Profile > HASH_AGGREGATE_OPERATOR > Operator Metrics.