Dremio maintains physically optimized representations of source data known as Data Reflections. The query optimizer can accelerate a query by utilizing one or more Data Reflections to partially or entirely satisfy that query, rather than processing the raw data in the underlying data source.
The Distributed Store can reside on HDFS, S3, ADLS, MapR-FS or NAS storage. Data Reflections are maintained in a high-performance columnar representation based on Apache Parquet and Apache Arrow, utilizing advanced compression techniques such as dictionary encoding, run-length encoding, and delta encoding.
A Data Reflection is always associated with a single dataset, also known as its anchor dataset. The anchor may be a physical dataset or a virtual dataset, so it may contain data from one or more data sources.
Data Reflections associated with one dataset can be utilized by the optimizer to accelerate a query on a different dataset.
For example, an acceleration whose anchor is
foo.bar.business may be used to accelerate a
foo.bar.restaurants, and vice versa.
Types of Data Reflections
There are various types of Data Reflections:
- Raw reflections – A raw reflection includes one or more fields from the anchor dataset, sorted, partitioned and distributed by specific fields.
- Aggregation reflections – An aggregation reflection includes one or more dimension and measure fields from the anchor dataset, sorted, partitioned and distributed by specified fields.
- External reflections – An external reflection is an un-managed reflection, which allows users to leverage existing datasets and summary tables built in external system as reflections in Dremio.
If the query is not being accelerated, make sure you are running the query rather than using preview. Reflection matching and optimizer choices are different depending on whether the query is being previewed or actually run.
Multiple Reflections for a Dataset
For any given dataset in the system, there may be zero or more raw reflections, and zero or more aggregation reflections. Dremio’s cost-based optimizer automatically chooses the best reflections for a given query when there are multiple options.