On this page

    Types of Reflections and the Benefits of Using them

    There are two primary types of reflections: raw and aggregation.

    Raw Reflections

    A reflection of this type consists of all of the rows and one or more fields of the underlying dataset that it is created from. The most basic raw reflection is equivalent to a SELECT * FROM the corresponding dataset. You can customize raw reflections by vertically partitioning data (choosing a subset of fields), horizontally partitioning the data (by defining one or more columns to be partition keys), and sorting the data on one or more fields.

    A raw reflection has the same number of records as its anchor dataset, but it is normally a small fraction of the size of the anchor dataset. In many cases, only a small subset of the columns are queried by users, and therefore, it makes sense to include only those columns in the raw reflection.

    Benefits of Using Raw Reflections

    Accelerate Queries on Unoptimized Data or Slow Storage

    Depending on the source, the underlying dataset may be suboptimal for scan-intensive workloads. In addition, the format of the data may be inefficient for scans (e.g., JSON, CSV), and the source may be accessible only through a slow network connection. If the source data is stored in a non-columnar format, using a raw reflection can dramatically improve the performance of queries.

    Accelerate “Needle-in-a-haystack” Queries

    Raw reflections preserve row-level data in a form that is optimized for scans. You can sort and partition the data on specific fields to allow Dremio’s query optimizer to make use of how the data is physically organized to improve query performance.

    Use Fewer Resources on Production Data Sources

    If a data source is deployed for operational workloads, there is a good chance it is not optimized for scan-intensive workloads. Raw reflections allow Dremio to execute most analytical queries without touching the data source.

    Transformations applied to data within the query can be expensive to compute, especially for elaborate CASE statements and functions. Rather than use system resources to calculate expensive transformations, queries can use raw reflections that store the results of transformations. Queries can achieve sub-second response times by using such pre-computed results.

    Accelerate Queries on Subsets of Columns in a Dataset

    When a dataset includes hundreds of fields, queries against it usually do not include each field. If you create reflections on subsets of fields, Dremio’s query optimizer uses the fewest of them that satisfies queries against the dataset, requiring scans of far less data.

    Accelerate Queries on Subsets of Rows in a Dataset

    Predicates that filter the data to subsets can be expensive In addition, resulting subsets can be significantly smaller than the total dataset, meaning that far more data was scanned than necessary. When you select a field on which to partition a reflection, Dremio maintains physical partitions of the data. Dremio’s query optimizer prunes partitions when appropriate to optimize query execution.

    Accelerate Queries That Perform Complex Joins

    Joining data between datasets can be both CPU and memory intensive, especially when the datasets involved are larger than memory, or the datasets reside in different locations. Using a raw reflection to pre-join data from one or more sources can significantly improve performance.

    Accelerate Queries That Sort Large Datasets

    Sorting large, unsorted datasets can be memory intensive, especially when the datasets involved are larger than memory. You can create raw reflections for which the data is already sorted.

    Aggregation Reflections

    These reflections accelerate BI-style queries that involve aggregations (GROUP BY queries) They can also be configured to work on a subset of the fields of a data source.

    Aggregation reflections are summarizations of anchor datasets, and therefore should have fewer records. The total number of records in the aggregation reflection can be calculated as the product of the number of unique values in each of the dimension columns. When the number of unique values in the dimension columns is low, the aggregation will be relatively small, and when there are many unique values it will be larger. For an example, see the section on aggregation reflections above.

    While it is possible to define an aggregation reflection that has the same number of records as its anchor dataset (by selecting a dimension with the same cardinality as the dataset), this would defeat the purpose of the aggregation reflection and would be equivalent to using a raw reflection on that dataset.

    Benefits of Using Aggregation Reflections

    Use aggregation reflections to store pre-computed aggregations for combinations of dimensions in datasets. Doing so improves the efficiency of GROUP BY statements, as well as SQL aggregation functions such as SUM and AVG, in queries issued by your data consumers.

    By pre-aggregating data and pre-computing measures (sum, count, min, max, etc), these expensive (cpu, ram) operations can be bypassed entirely at query runtime.