Creating Data Reflections

Administrators can create and manage reflections for datasets. Different types of reflections help accelerate different types of workloads.

Raw Reflections

Raw reflections are typically used to accelerate range lookups, common joins, repetitive transformation patterns or just slow datasets (e.g. CSV files). These kinds of reflections are defined by providing a set of fields to be included in the reflection -- these are called Display fields.

Example
Let's assume we want to accelerate a Sales dashboard that displays individual transactions for a single customer that occurred on a given day across multiple years. The following query might drive this dashboard.

SELECT *
FROM mySource.sales
WHERE 
     -- e.g. all records for 4th of July across all years
     EXTRACT(day from tx_date) = 4 AND
     EXTRACT(month from tx_date) = 7 AND
     customer_id = 123456789

The following raw reflection on the mySource.sales table accelerates the example query for all filter options on tx_date:

  • Display: All fields
  • (Optional) Partition: tx_date (low cardinality)
  • (Optional) Sort: customer_id (high cardinality)

See Optimizing Data Reflections for sorting and partitioning options.

Aggregation Reflections

Aggregation reflections are typically used to accelerate BI style aggregation queries. These kinds of reflections are defined by providing a set of dimension and measure values. When building an aggregation reflection, Dremio automatically rolls-up measure values that are selected (default or user enabled) for all combinations of the specified dimension values. This way, at query time, Dremio can serve results from pre-rolled-up version of the data (aggregation reflection), instead of having to scan and roll-up the raw data.

The following Selected Measures are available for inclusion into an aggregation reflection:

Selected Measure Default Setting
SUM ENABLED
COUNT ENABLED
MAX DISABLED
MIN DISABLED
Approximate distinct count DISABLED
AVG Automatically available if SUM & COUNT selected

Dimensions and measures can be determined by inspecting a query or a set of queries that have commonalities. The following is a high level overview of how to select candidate fields for dimensions and measures:

Dimension Candidates

  • Aggregated fields (e.g. GROUP BY field1)
  • Filtered fields (e.g. WHERE field1 IN (x,y,z))
  • Fields used in CASE statement conditions (e.g. CASE WHEN field1 IN (x,y,z) THEN...)
  • Fields used in DISTINCT or COUNT DISTINCT statements

Measure Candidates

  • Fields used in SUM, COUNT, MAX, MIN AVG statements (e.g. total sales, average number of products sold, largest sales opportunity)
  • Fields used in NDV statements (e.g. monthly active users, unique daily sessions)

Example
Let's assume we want to accelerate a Sales dashboard that shows metrics for each region for a set of product categories based on user input, driven by the following query:

SELECT 
     AVG(sales_amount) average_sales,
     SUM(sales_amount) total_sales,
     MAX(sales_amount) max_tx,
     COUNT(*) tx_count,
     region
FROM mySource.sales
GROUP BY region
WHERE product_category IN ('Electronics','Clothing')

The following aggregation reflection on the mySource.sales table accelerates all combinations of the above query, including any different filter conditions on product_category:

  • Dimensions: product_category, region
  • Measures: sales_amount

See Optimizing Data Reflections for sorting and partitioning options.

External Reflections

External reflections can be used to leverage datasets that were created outside of Dremio as reflections. For example, a common scenario is using aggregation tables created and maintained by an existing process outside of Dremio as a reflection, without having to replicate a similar pipeline in Dremio.

Example
Let's assume we have an aggregation table called sales_by_region and that this table is an aggregated version of table sales by region. Both of these tables are stored in the mySource source.

To leverage sales_by_region as an external reflection on the sales table:

  1. Create and save a virtual dataset that encapsulates the transformation logic to get from sales to sales_by_region. This is to inform Dremio of the relationship between sales and sales_by_region in mySource.
    CREATE VDS "myWorkspace"."sales_by_region" AS
    SELECT 
      AVG(sales_amount) average_sales,
      SUM(sales_amount) total_sales,
      COUNT(*) sales_count,
      region
    FROM mySource.sales
    GROUP BY region
    
  2. Create an external reflection that maps the virtual dataset above to the aggregation table in mySource:
    ALTER DATASET "myWorkspace"."sales_by_region" 
    CREATE EXTERNAL REFLECTION "external_sales_by_region" 
    USING "mySource"."sales_by_region"
    

External reflections do not appear in the UI. They can be inspected using the sys.reflections system dataset (e.g. select * from sys.reflections)

Starflake Reflections

When materializing reflections on datasets with joins, Dremio keeps statistics and detects relationships for each join (e.g. 1-1, many-1). If the joins are non-expanding, Dremio can leverage this property to accelerate a larger set of queries.

If a user creates a reflection on a dataset that joins a fact table with three dimension tables, given that this reflection meets the above criteria, Dremio can accelerate queries that include any subset of these joins (e.g. fact table joined with just one of the dimension tables), without having to define multiple reflections.

Both raw and aggregation reflections can support starflake acceleration.

Example
For example, let's assume we have a virtual dataset (e.g. Enriched Sales Dataset) that joins the following datasets:

  • Sales (fact) - sales_amount, location_id, employee_id, tx_timestamp
  • Locations(dimension) - location_id, zip, region_name
  • Employees (dimension) - employee_id, position, department

Let's also assume we have an aggregation reflection on the above Enriched Sales Dataset with the following configuration:

  • Dimensions: zip, region_name, position, department, employee_id, location_id, timestamp
  • Measures: sales_amount

In this scenario, if a user is querying Enriched Sales Dataset, since there is an aggregation reflection, queries including all combinations of dimensions and measures would be accelerated:

SELECT
     AVG(sales_amount) average_sales,
     region_name,
     position
FROM "Enriched Sales Dataset"
GROUP BY region_name, position

With starflake reflections, ad-hoc queries that just include any subset of the joins in Enriched Sales Dataset would also be accelerated:

SELECT
     SUM(s.sales_amount) average_sales,
     l.zip
FROM 
     Sales s, 
     Locations l
WHERE s.location_id = l.location_id
GROUP BY l.zip

Queries that were accelerated via starflake reflections will feature the regular acceleration flame icon, with an added blue starflake icon in the Jobs page.


results matching ""

    No results matching ""