Best Practices for Creating Raw and Aggregation Reflections
Dremio recommends that you follow these best practices when creating reflections.
Design reflections for expensive query patterns
- Review query history (jobs) to determine the most expensive and most-frequent queries being submitted.
- Look in the job profiles for these queries. Datasets referenced by multiple queries that perform expensive scans, joins, and aggregations are good candidates for reflections.
- Examine the SQL for the selected queries that reference the same dataset to find patterns that can help you define a reflection on that dataset that satisfies as many of those queries as possible.
Avoid the “more are always better” approach
Creating more reflections than are necessary to support your data consumers can lead to the use of more resources than might be optimal for your environment, both in terms of system resources and the time and attention devoted to working with them.
Establish criteria for when to create reflections
Create them only when data consumers are experiencing slow query responses, or when reports are not meeting established SLAs.
Create reflections where they do the most work without duplicating the work of other reflections
You can do so by following the recommendation in the section “Layering Datasets” in “Datasets”.
Establish a routine for checking how often reflections are used by the query planner
At regular intervals, check for reflections that are no longer being used and evaluate whether they should be removed. Query patterns can change over time, and frequently-used reflections can gradually become less relevant.
Use Supporting Anchor Datasets
Anchor datasets for reflections are virtual datasets that data consumers have access to from their business-intelligence tools. As you develop a better understanding of query patterns, you might want to support those patterns by creating reflections from virtual datasets that perform expensive joins, transformations, filters, calculations, or a combination of those operations. You would probably not want data consumers to be able to access those virtual datasets directly in situations where the query optimizer did not use any of the reflections created from those datasets. Repeated and concurrent queries on such virtual datasets could put severe strain on system resources.
You can prevent queries run by data consumers from accessing those virtual datasets directly. Anchor datasets that perform expensive operations and to which access is restricted are called supporting anchor datasets.
For example, suppose that you find these three very large physical datasets are used in many queries:
You determine that there are a few common patterns in the user queries on these datasets:
- The queries frequently join the three datasets together.
- Queries always filter by
commit_date < ship_date
- There is a calculated field in most of the queries:
extended_price * (1-discount) AS revenue
You can create a virtual dataset that applies these common patterns, and then create a raw reflection to accelerate queries that follow these patterns.
First, you create a folder in the Dremio space that your data consumers have access to. Then, you configure this folder to be invisible and inaccessible to the data consumers.
Next, you write the query to create the virtual dataset, you follow these guidelines:
SELECT *to include all fields, making it possible for the query optimizer to accelerate the broadest set of queries. Alternatively, if you know exactly which subset of fields are used in the three datasets, you can include just that subset in the virtual dataset.
- Add any calculated fields, which in this case is the revenue field.
- Apply the appropriate join on the three datasets.
- Apply any filters that are used by all queries, which in this case is only
commit_date < ship_date.
- Always use the most generic predicate possible to maximize the number of queries that will match.
Next, you run the following query to create a new virtual dataset:
SELECT *, extendedprice * (1 - discount) AS revenue FROM customer AS c, orders AS o, lineitem AS l WHERE c.c_custkey = o.o_custkey AND l.l_orderkey = o.o_orderkey AND o.commit_date < o.ship_date
Then, you save the virtual dataset in the folder that you created earlier.
Finally, you create one or more raw reflections on this new supporting anchor dataset. If most of the queries against the dataset were aggregation queries, you could create an aggregation reflection. In both cases, you can select fields, as needed, to sort on or partition on.
The result is that, even though the data consumers do not have access to the supporting anchor dataset, Dremio can accelerate their queries by using the new reflections as long as they have access to the physical datasets that the reflections are ultimately derived from: Customer, Order, and Lineitem.
If the query optimizer should determine that a query cannot be satisfied by any of the reflections, it is possible, if no other virtual datasets can satisfy it, for the query to run directly against the physical datasets, as is always the case with any query.
Horizontally Partition Reflections that Have Many Rows
If you select a field for partitioning in a data reflection, Dremio physically groups records together into a common directory on the file system. For example, if you partition by the field Country, in which the values are two-letter abbreviations for the names of countries, such as US, UK, DE, and CA, Dremio stores the data for each country in a separate directory named US, UK, DE, CA, and so on. This optimization allows Dremio to scan a subset of the directories based on the query, which is an optimization called partition pruning.
If a user queries on records for which the value of Country is US or UK, then Dremio can apply partition pruning to scan only the US and UK directories, significantly reducing the amount of data that is scanned for the query.
When you are selecting a partitioning field for a data reflection, ask yourself these questions:
- Is the field used in many queries?
- Are there relatively few unique values in the field (low cardinality)?
To partition the data, Dremio must first sort all records, which consumes resources. Accordingly, partition data only on fields that can be used to optimize queries. In addition, the number of unique values for a field should be relatively small, so that Dremio creates only a relatively small number of partitions. If all values in a field are unique, the cost to partition outweighs the benefit.
In general, Dremio recommends the total number of partitions for a reflection to be less than 10,000.
Sort Reflections on High-Cardinality Fields
The sort option is useful for optimizing queries that use filters or range predicates, especially on fields with high cardinality. If sorting is enabled, during query execution, Dremio skips over large blocks of records based on filters on sorted fields.
Dremio sorts data during the execution of a query if a reflection spans multiple nodes and is composed of multiple partitions.
Sorting on more than one field in a single data reflection typically does not improve read performance significantly and increases the costs of maintenance tasks.
For workloads that need sorting on more than one field, consider creating multiple reflections, each being sorted on a single field.
Create Reflections from Joins that are Based on Joins from Multiple Queries
Joins between datasets tend to be expensive. You can reduce the costs of joins by performing them only when building and refreshing reflections.
As an administrator, you can identify a group of queries that use similar joins. Then, you can create a general query that uses a join that is based on the similar joins, but does not include any additional predicates from the queries in the group. This generic query can serve as the basis of a raw reflection, an aggregation reflection, or both.
For example, consider the following three queries which use similar joins on virtual datasets A, B and C:
SELECT a1, b1, c1 FROM a,b,c WHERE a.3 > '2001-01-01' AND b.3 IN ('red','blue','green') SELECT a1, a2, c1, COUNT(b.1) FROM a,b,c WHERE a.size = 'M' AND b.2 < 10 AND c.2 > 2 GROUP BY a1, a2, c1 SELECT a1, b2 FROM a,b,c WHERE c.1 = 123
You can write and run this generic query to create a raw reflection to accelerate all three original queries:
SELECT a1, a2, a3, b1, b2, b3, c1, c2 FROM a,b,c WHERE a.3 > '2001-01-01' AND b.3 IN ('red','blue','green') AND c.1 = 123