Optimizing Data Reflections
Based on query patterns and properties of the underlying dataset, Data Reflections can be further optimized by specifying partitioning, sorting, and distribution. This can be configured when defining accelerations in the UI or by using SQL commands.
These optimizations can be used with both Raw and Aggregation Reflections.
Partitioning
Data Reflections can be partitioned on one or more columns. When specified, Dremio creates multiple files based on partitioning configuration.
Low cardinality fields are ideal for partitioning (e.g. Day-Month-Year). Ideally, the overall cardinality should be less than 10,000 values – a smaller number of partitions is preferred.
Dremio optimizes performance by pruning partitions when a query has a filter on a partitioned column.
Sorting
Data Reflections can be locally sorted on one or more columns. Sorting ensures that the records are sorted within each node and partition (if any).
Sorting is especially useful in scenarios with range queries and filters. If sorting is enabled, during query execution, Dremio skips over large blocks of records based on filters on sorted columns.
Dremio recommends sorting on single fields only.
Arrow Caching
Dremio 4.7+ users can improve query performance by caching their data reflections in the Apache Arrow format. Because the Apache Arrow format requires more space than the Parquet format, Dremio administrators should consider the possibility of increased use of disk space.
Dremio users must enable the feature for each reflection.
Note:
Dremio automatically caches accelerated queries by using the Arrow format.
To enable Arrow Caching:
- In the
Dataset Settings
modal, clickReflections
. - Click
Switch to Advanced
. - Click the gear icon next to either
Raw Reflections
orAggregation Reflections
. - Toggle
Arrow caching
to the on position on theSettings: Raw Reflection
orSettings: Aggregation Reflection
modals.