Dremio maintains physically optimized representations of source data known as Data Reflections. The query optimizer can accelerate a query by utilizing one or more Data Reflections to partially or entirely satisfy that query, rather than processing the raw data in the underlying data source.
The Distributed Store can reside on HDFS, S3 or local storage. Data Reflections are maintained in a high-performance columnar representation based on Apache Parquet and Apache Arrow, utilizing advanced compression techniques such as dictionary encoding, run-length encoding, and delta encoding.
A Data Reflection is always associated with a single dataset, also known as its anchor dataset. The anchor may be a physical dataset or a virtual dataset, so it may contain data from one or more data sources.
Data Reflections associated with one dataset can be utilized by the optimizer to accelerate a query on a different dataset. For example, an acceleration whose anchor is
foo.bar.businessmay be used to accelerate a query on
foo.bar.restaurants, and vice versa.
Types of Data Reflections
There are various types of Data Reflections:
- Raw reflections. A raw reflection includes one or more fields from the anchor dataset, sorted, partitioned and distributed by specific fields.
- Aggregation reflections. An aggregation reflection includes one or more dimension and measure fields from the anchor dataset, sorted, partitioned and distributed by specified fields.
- External reflections. An external reflection is an un-managed reflection, which allows users to leverage existing datasets and summary tables built in external system as reflections in Dremio.
See Creating Data Reflections for use-cases for each reflection type.
Multiple Reflections for a Dataset
For any given dataset in the system, there may be zero or more raw reflections, and zero or more aggregation reflections. Dremio's cost-based optimizer automatically chooses the best reflections for a given query when there are multiple options.