Optimizing Data Reflections

Based on query patterns and properties of the underlying dataset, Data Reflections can be further optimized by specifying partitioning, sorting, and distribution. This can be configured when defining accelerations in the UI or by using SQL commands.

These optimizations can be used with both Raw and Aggregation Reflections.

Partitioning

Data Reflections can be partitioned on one or more columns. When specified, Dremio creates multiple files based on partitioning configuration.

Low cardinality fields are ideal for partitioning (e.g. Day-Month-Year). Ideally, the overall cardinality should be less than 10,000 values – a smaller number of partitions is preferred.

Dremio optimizes performance by pruning partitions when a query has a filter on a partitioned column.

Sorting

Data Reflections can be locally sorted on one or more columns. Sorting ensures that the records are sorted within each node and partition (if any).

Sorting is especially useful in scenarios with range queries and filters. If sorting is enabled, during query execution, Dremio skips over large blocks of records based on filters on sorted columns.

Dremio recommends sorting on single fields only.

Arrow Caching

Dremio 4.7+ users can improve query performance by caching their data reflections in the Apache Arrow format. Because the Apache Arrow format requires more space than the Parquet format, Dremio administrators should consider the possibility of increased use of disk space.

Dremio users must enable the feature for each reflection.

Note:
Dremio automatically caches accelerated queries by using the Arrow format.

To enable Arrow Caching:

  1. In the Dataset Settings modal, click Reflections.
  2. Click Switch to Advanced.
  3. Click the gear icon next to either Raw Reflections or Aggregation Reflections.
  4. Toggle Arrow caching to the on position on the Settings: Raw Reflection or Settings: Aggregation Reflection modals.