Accelerating Queries with Reflections
A reflection is an optimized materialization of source data or a query, similar to a materialized view, that is derived from an existing virtual or physical dataset. A dataset that a reflection is derived from is that refection’s anchor dataset.
Dremio’s query optimizer can accelerate a query against datasets by using one or more reflections to partially or entirely satisfy that query, rather than processing the raw data in the underlying data source. Queries do not need to reference reflections directly. Instead, Dremio rewrites queries on the fly to use the reflections that satisfy them.
When Dremio receives a query, it determines first whether any reflections have at least one physical dataset in common with the datasets that the query references. If any reflections do, Dremio evaluates them to determine whether they satisfy the query. Then, if any reflections do satisfy the query, Dremio generates a query plan that uses them.
Dremio then compares the cost of the plan to the cost of executing the query directly against the physical datasets, and selects the plan with the lower cost. Finally, Dremio executes the selected query plan. Typically, plans that use one or more reflections are less expensive than plans that run against raw data. To get the best results from using reflections, see “Best Practices for Creating Raw and Aggregation Reflections”.
For example, suppose that three physical datasets named
items are created from data sources. A data engineer filling a requirement for allowing single queries across all three datasets creates a virtual dataset named
order_detail to join them:
SELECT * FROM ((order INNER JOIN customer ON order.cust_id = customer.cust_id INNER JOIN item on order.item_id = item.item_id))
A business analyst wants to get a summary of the orders for each customer in the United States. She creates a virtual dataset named
customer_summary_US that is defined by this query on
SELECT cust_name, city, COUNT(item_id), SUM(item_price) FROM order_detail WHERE cust_country = 'US' GROUP BY cust_name ORDER BY SUM(item_price) DESC
Because this query is computationally intensive, takes a bit of time to run, and is a query that she runs frequently, the business analyst, together with the data engineer, creates a reflection from her virtual dataset, following best practices. They give the reflection the same name as the virtual dataset.
The relationships between the reflection, datasets, and the data sources can be represented like this:
When the business analyst runs her query from her BI client application, Dremio’s query planner runs the query against the reflection, not against the virtual dataset
customer_summary_US, even though her query references the dataset. Her query runs much faster because the Dremio engine running it does not have to descend through the datasets to the raw data in the data source.