Dremio maintains physically optimized representations of source data known as Data Reflections. The query optimizer can accelerate a query by utilizing one or more Data Reflections to partially or entirely satisfy that query, rather than processing the raw data in the underlying data source.
The Distributed Store can reside on HDFS, S3, ADLS, MapR-FS or NAS storage. Data Reflections are maintained in a high-performance columnar representation based on Apache Parquet and Apache Arrow, utilizing advanced compression techniques such as dictionary encoding, run-length encoding, and delta encoding.
A Data Reflection is always associated with a single dataset, also known as its anchor dataset. The anchor may be a physical dataset or a virtual dataset, so it may contain data from one or more data sources.
Data Reflections associated with one dataset can be utilized by the optimizer to accelerate a query on a different dataset. For example, an acceleration whose anchor is
foo.bar.businessmay be used to accelerate a query on
foo.bar.restaurants, and vice versa.
There are various types of Data Reflections:
See Creating Data Reflections for use-cases for each reflection type.
If the query is not being accelerated, make sure you are running the query rather than using preview. Reflection matching and optimizer choices are different depending on whether the query is being previewed or actually run.
For any given dataset in the system, there may be zero or more raw reflections, and zero or more aggregation reflections. Dremio’s cost-based optimizer automatically chooses the best reflections for a given query when there are multiple options.