Skip to main content
Version: current [26.x Preview]

Accelerate Queries

In Dremio, queries can be accelerated with reflections and results caching.

Reflections

A reflection is a precomputed and optimized copy of source data or a query result, designed to speed up query performance. It is derived from an existing table or view, known as its anchor. Reflections can be:

  • Autonomous: automatically created and managed by Dremio.
  • Manual: created and managed by you.

Dremio's query optimizer uses reflections to accelerate queries by avoiding the need to scan the original data. Instead of querying the raw source, Dremio automatically rewrites queries to use reflections when they provide the necessary results, without requiring you to reference them directly.

When Dremio receives a query, it determines first whether any reflections have at least one table in common with the tables and views that the query references. If any reflections do, Dremio evaluates them to determine whether they satisfy the query. Then, if any reflections do satisfy the query, Dremio generates a query plan that uses them.

Dremio then compares the cost of the plan to the cost of executing the query directly against the tables, and selects the plan with the lower cost. Finally, Dremio executes the selected query plan. Typically, plans that use one or more reflections are less expensive than plans that run against raw data.

Types

There are different types of reflections tailored to specific workloads:

  • Raw Reflections: retain the same number of records as its anchor while allowing a subset of columns. It enhances query performance by materializing complex views, transforming data from non-performant sources into the Iceberg table format optimized for large-scale analytics, and utilizing partitioning and sorting for faster access. By precomputing and storing data in an optimized format, raw reflections significantly reduce query latency and improve overall efficiency.

  • Aggregation Reflections: accelerate BI-style queries that involve aggregations (GROUP BY queries) by precomputing results (like SUM, COUNT, AVG, GROUP BY) across selected dimensions and measures. By precomputing expensive computations, they significantly improve query performance at runtime. These reflections are ideal for analytical workloads with frequent aggregations on large datasets.

  • External Reflections: reference precomputed tables in external data sources instead of materializing reflections within Dremio, eliminating refresh overhead and storage costs. You can use an external reflection by defining a view in Dremio that matches the precomputed table and map the view to the external data source. The data in the precomputed table is not refreshed by Dremio. When querying the view, Dremio’s query planner leverages the external reflection to generate optimal execution plans, improving query performance without additional storage consumption in Dremio.

  • Starflake Reflections: optimize multi-table joins by leveraging precomputed relationships between fact and dimension tables. When joins do not duplicate rows, Dremio can accelerate queries using Reflections even if they include only a subset of the joins in Reflections, reducing the need for multiple reflections on different combinations of tables.

Results Cache

Results caching improves query performance by reusing results from previous executions of the same deterministic query, provided that the underlying dataset remains unchanged. Whether a query runs with or without results caching, it always returns the same results.

Results caching is client-agnostic, meaning a query executed in the Dremio Console will result in a cache hit even if it is later re-run through JDBC or Arrow Flight. For a query to use the cache, its query plan must remain identical to the original cached version. Any changes to the schema or dataset generate a new query plan, invalidating the cache.

Results caching also supports seamless coordinator scale-out, allowing newly added coordinators to benefit immediately from previously cached results.

Conditions

For a query result to be cached and retrieved efficiently, specific conditions must be met:

  • The SQL statement must be a SELECT statement, and the query must be read from an Iceberg, Delta (via Unity Catalog), or Parquet dataset.
  • Other data sources, such as relational databases, CSV, JSON, or TEXT, support result caching only if they are covered by a Reflection.
  • Queries containing dynamic functions such as QUERY_USER, IS_MEMBER, RAND, CURRENT_DATE, or NOW are not eligible for results caching. Similarly, queries that reference SYS or INFORMATION_SCHEMA tables, involve external queries, or are executed by different users will not be cached.
  • The result set size, when stored in Arrow format, must not exceed 20 MB.

Storage

Cached results are stored in the distributed storage, configured in dremio.conf. This storage must be accessible to both coordinators and executors. Executors write cache entries as Arrow data files and read them when processing SELECT queries that result in a cache hit. Coordinators are responsible for managing the deletion of expired cache files.

Deletion

A background task running on one of the Dremio coordinators handles cache expiration. This task runs every hour to mark cache entries that have not been accessed in the past 24 hours as expired and subsequently deletes them along with their associated cache files.