On this page

    Creating Data Reflections

    This topic describes how to create and manage reflections for tables or views. Different types of reflections help accelerate different types of workloads.


    If the query is not being accelerated, make sure you are running the query rather than using preview. Reflection matching and optimizer choices are different depending on whether the query is being previewed or actually run.

    Raw Reflections

    Raw reflections are typically used to accelerate range lookups, common joins, repetitive transformation patterns or just slow datasets (e.g. CSV files). These kinds of reflections are defined by providing a set of fields to be included in the reflection – these are called Display fields.

    Example: Raw reflection on a Sales

    Let’s assume we want to accelerate a Sales dashboard that displays individual transactions for a single customer that occurred on a given day across multiple years. The following query might drive this dashboard.

    SELECT *
    FROM mySource.sales
         -- e.g. all records for 4th of July across all years
         EXTRACT(day from tx_date) = 4 AND
         EXTRACT(month from tx_date) = 7 AND
         customer_id = 123456789

    The following raw reflection on the mySource.sales table accelerates the example query for all filter options on tx_date:

    • Display: All fields
    • (Optional) Partition: tx_date (low cardinality)
    • (Optional) Sort: customer_id (high cardinality)

    See Optimizing Data Reflections for sorting and partitioning options.

    Aggregation Reflections

    Aggregation reflections are typically used to accelerate BI style aggregation queries. These kinds of reflections are defined by providing a set of dimension and measure values. When building an aggregation reflection, Dremio automatically rolls-up measure values that are selected (default or user enabled) for all combinations of the specified dimension values. This way, at query time, Dremio can serve results from pre-rolled-up version of the data (aggregation reflection), instead of having to scan and roll-up the raw data.

    The following Selected Measures are available for inclusion into an aggregation reflection:

    Selected MeasureDefault Setting
    Approximate distinct countDISABLED
    AVGAutomatically available if SUM & COUNT selected

    Dimensions and measures can be determined by inspecting a query or a set of queries that have commonalities. The following is a high level overview of how to select candidate fields for dimensions and measures:

    Dimension Candidates

    • Aggregated fields (e.g. GROUP BY field1)
    • Filtered fields (e.g. WHERE field1 IN (x,y,z))
    • Fields used in CASE statement conditions (e.g. CASE WHEN field1 IN (x,y,z) THEN...)
    • Fields used in DISTINCT or COUNT DISTINCT statements

    Measure Candidates

    • Fields used in SUM, COUNT, MAX, MIN AVG statements (e.g. total sales, average number of products sold, largest sales opportunity)
    • Fields used in NDV statements (e.g. monthly active users, unique daily sessions)

    Example: Aggregation reflection on Sales

    Let’s assume we want to accelerate a Sales dashboard that shows metrics for each region for a set of product categories based on user input, driven by the following query:

         AVG(sales_amount) average_sales,
         SUM(sales_amount) total_sales,
         MAX(sales_amount) max_tx,
         COUNT(*) tx_count,
    FROM mySource.sales
    GROUP BY region
    WHERE product_category IN ('Electronics','Clothing')

    The following aggregation reflection on the mySource.sales table accelerates all combinations of the above query, including any different filter conditions on product_category:

    • Dimensions: product_category, region
    • Measures: sales_amount

    See Optimizing Data Reflections for sorting and partitioning options.

    External Reflections

    External reflections can be used to leverage datasets that were created outside of Dremio as reflections. For example, a common scenario is using aggregation tables created and maintained by an existing process outside of Dremio as a reflection, without having to replicate a similar pipeline in Dremio.

    Example: External reflection on Sales

    Let’s assume we have an aggregation table called sales_by_region and that this table is an aggregated version of table sales by region. Both of these tables are stored in the mySource source.

    To leverage sales_by_region as an external reflection on the sales table:

    1. Create and save a view that encapsulates the transformation logic to get from sales to sales_by_region. This is to inform Dremio of the relationship between sales and sales_by_region in mySource.
    CREATE VDS "myWorkspace"."sales_by_region" AS
        AVG(sales_amount) average_sales,
        SUM(sales_amount) total_sales,
        COUNT(*) sales_count,
    FROM mySource.sales
    GROUP BY region
    1. Create an external reflection that maps the view above to the aggregation table in mySource:
    ALTER DATASET "myWorkspace"."sales_by_region" 
    CREATE EXTERNAL REFLECTION "external_sales_by_region" 
    USING "mySource"."sales_by_region"

    External reflections do not appear in the UI. They can be inspected using the sys.reflections system dataset (e.g. select * from sys.reflections), along with information regarding their current status.

    Starflake Reflections

    When materializing reflections on datasets with joins, Dremio keeps statistics and detects relationships for each join (e.g. 1-1, many-1). If the joins are non-expanding, Dremio can leverage this property to accelerate a larger set of queries.

    If a user creates a reflection on a dataset that joins a fact table with three dimension tables, given that this reflection meets the above criteria, Dremio can accelerate queries that include any subset of these joins (e.g. fact table joined with just one of the dimension tables), without having to define multiple reflections.

    Both raw and aggregation reflections can support starflake acceleration.

    Example: Starflake reflection on Sales

    For example, let’s assume we have a view (e.g. Enriched Sales Dataset) that joins the following datasets:

    • Sales (fact) - sales_amount, location_id, employee_id, tx_timestamp
    • Locations(dimension) - location_id, zip, region_name
    • Employees (dimension) - employee_id, position, department

    Let’s also assume we have an aggregation reflection on the above Enriched Sales Dataset with the following configuration:

    • Dimensions: zip, region_name, position, department, employee_id, location_id, timestamp
    • Measures: sales_amount

    In this scenario, if a user is querying Enriched Sales Dataset, since there is an aggregation reflection, queries including all combinations of dimensions and measures would be accelerated:

         AVG(sales_amount) average_sales,
    FROM "Enriched Sales Dataset"
    GROUP BY region_name, position

    With starflake reflections, ad-hoc queries that just include any subset of the joins in Enriched Sales Dataset would also be accelerated:

         SUM(s.sales_amount) average_sales,
         Sales s, 
         Locations l
    WHERE s.location_id = l.location_id
    GROUP BY l.zip

    Queries that were accelerated via starflake reflections will feature the regular acceleration flame icon, with an added blue starflake icon in the Jobs page.