Refresh Reflections
The data in a Reflection can become stale and may need to be refreshed. Refreshing a Reflection triggers two updates:
- The data stored in the Apache Iceberg table for the Reflection is updated.
- The metadata that stores details about the Reflection is updated.
Dremio does not refresh the data that external Reflections are mapped to.
Types of Reflection Refresh
How Reflections are refreshed depend on the format of the base table.
Apache Iceberg Tables, Filesystem Sources, AWS Glue Sources, and Hive Sources
There are two methods that can be used to refresh Reflections that are defined either on Iceberg tables or on these types of datasets in filesystem, AWS Glue, and Hive sources:
- Parquet datasets in Filesystem sources (on S3, Azure Storage, Google Cloud Storage, or HDFS)
- Parquet datasets, Avro datasets, or non-transactional ORC datasets on AWS Glue or Hive (Hive 2 or Hive 3) sources
Iceberg tables in all supported file-system sources (Amazon S3, Azure Storage, Google Cloud Storage, and HDFS) and non-file-system sources (AWS Glue, Hive, and Nessie) can be refreshed with either of these methods.
Incremental Refreshes
There are two types of incremental refreshes:
- Incremental refreshes when changes to an anchor table are only append operations
- Incremental refreshes when changes to an anchor table include non-append operations
- Whether an incremental refresh can be performed depends on the outcome of an algorithm.
- The initial refresh of a Reflection is always a full refresh.
Incremental Refreshes When Changes to an Anchor Table Are Only Append Operations
Optimize operations on Iceberg tables are also supported for this type of incremental refresh.
This type of incremental refresh is used only when the changes to the anchor table are appends and do not include updates or deletes. There are two cases to consider:
-
When a Reflection is defined on one anchor table
When a Reflection is defined on an anchor table or on a view that is defined on one anchor table, an incremental refresh is based on the differences between the current snapshot of the anchor table and the snapshot at the time of the last refresh.
-
When a Reflection is defined on a view that joins two or more anchor tables
When a Reflection is defined on a view that joins two or more anchor tables, whether an incremental refresh can be performed depends on how many anchor tables have changed since the last refresh of the Reflection:
- If just one of the anchor tables has changed since the last refresh, an incremental refresh can be performed. It is based on the differences between the current snapshot of the one changed anchor table and the snapshot at the time of the last refresh.
- If two or more tables have been refreshed since the last refresh, then a full refresh is used to refresh the Reflection.
Incremental Refreshes When Changes to an Anchor Table Include Non-append Operations
For Iceberg tables, this type of incremental refresh is used when the changes are DML operations that delete or modify the data (UPDATE, DELETE, etc.) made either through the Copy-on-Write (COW) or the Merge-on-Read (MOR) storage mechanism. For more information about COW and MOR, see Row-Level Changes on the Lakehouse: Copy-On-Write vs. Merge-On-Read in Apache Iceberg.
For sources in filesystems or AWS Glue, non-append operations can include, for example:
- In filesystem sources, files being deleted from Parquet datasets
- In AWS Glue sources, DML-equivalent operations being performed on Parquet datasets, Avro datasets, or non-transactional ORC datasets
Both the anchor table and the Reflection must be partitioned, and the partition transforms that they use must be compatible.
There are two cases to consider:
-
When a Reflection is defined on one anchor table
When a Reflection is defined on an anchor table or on a view that is defined on one anchor table, an incremental refresh is based on Iceberg metadata that is used to identify modified partitions and to restrict the scope of the refresh to only those partitions.
-
When a Reflection is defined on a view that joins two or more anchor tables
When a Reflection is defined on a view that joins two or more anchor tables, whether an incremental refresh can be performed depends on how many anchor tables have changed since the last refresh of the Reflection:
- If just one of the anchor tables has changed since the last refresh, an incremental refresh can be performed. It is based on Iceberg metadata that is used to identify modified partitions and to restrict the scope of the refresh to only those partitions.
- If two or more tables have been refreshed since the last refresh, then a full refresh is used to refresh the Reflection.
Dremio uses Iceberg tables to store metadata for filesystem and AWS Glue sources.
For information about partitioning Reflections and applying partition transforms, see the section Horizontally Partition Reflections that Have Many Rows.
For information about partitioning Reflections in ways that are compatible with the partitioning of anchor tables, see Partition Reflections to Allow for Partition-Based Incremental Refreshes.
Full Refreshes
In a full refresh, a Reflection is dropped, recreated, and loaded.
- Whether a full refresh is performed depends on the outcome of an algorithm.
- The initial refresh of a Reflection is always a full refresh.
Algorithm for Determining Whether an Incremental or a Full Refresh Is Used
The following algorithm determines which refresh method is used:
- If the Reflection has never been refreshed, then a full refresh is performed.
- If the Reflection is created from a view that uses nested group-bys, unions, window functions, or joins other than inner or cross joins, then a full refresh is performed.
- If the Reflection is created from a view that joins two or more anchor tables and more than one anchor table has changed since the previous refresh, then a full refresh is performed.
- If the Reflection is based on a view and the changed anchor table is used multiple times in that view, then a full refresh is performed.
- If the changes to the anchor table are only appends, then an incremental refresh based on table snapshots is performed.
- If the changes to the anchor table include non-append operations, then the compatibility of the partitions of the anchor table and the partitions of the Reflection is checked:
- If the partitions of the anchor table and the partitions of the Reflection are not compatible, or if either the anchor table or the Reflection is not partitioned, then a full refresh is performed.
- If the partition scheme of the anchor table has been changed since the last refresh to be incompatible with the partitioning scheme of a Reflection, and if changes have occurred to data belonging to a prior partition scheme or the new partition scheme, then a full refresh is performed.
To avoid a full refresh when these two conditions hold, update the partition scheme for Reflection to match the partition scheme for the table. You do so in the Advanced view of the Reflection editor or through the
ALTER DATASETSQL command. - If the partitions of the anchor table and the partitions of the Reflection are compatible, then an incremental refresh is performed.
Because this algorithm is used to determine which type of refresh to perform, you do not select a type of refresh for Reflections in the settings of the anchor table.
However, no data is read in the REFRESH REFLECTION job for Reflections that are dependent only on Iceberg, Parquet, Avro, non-transactional ORC datasets, or other Reflections and that have no new data since the last refresh based on the table snapshots. Instead, a "no-op" Reflection refresh is planned and a materialization is created, eliminating redundancy and minimizing the cost of a full or incremental Reflection refresh.
Delta Lake tables
Only full refreshes are supported. In a full refresh, the Reflection being refreshed is dropped, recreated, and loaded.
All Other Tables
-
Incremental refreshes
Dremio appends data to the existing data for a Reflection. Incremental refreshes are faster than full refreshes for large Reflections, and are appropriate for Reflections that are defined on tables that are not partitioned.
There are two ways in which Dremio can identify new records:
- For directory datasets in file-based data sources like S3 and HDFS: Dremio can automatically identify new files in the directory that were added after the prior refresh.
- For all other datasets (such as datasets in relational or NoSQL databases): An administrator specifies a strictly monotonically increasing field, such as an auto-incrementing key, that must be of type BigInt, Int, Timestamp, Date, Varchar, Float, Double, or Decimal. This allows Dremio to find and fetch the records that have been created since the last time the acceleration was incrementally refreshed.
cautionUse incremental refreshes only for Reflections that are based on tables and views that are appended to. If records can be updated or deleted in a table or view, use full refreshes for the Reflections that are based on that table or view.
-
Full refreshes
In a full refresh, the Reflection being refreshed is dropped, recreated, and loaded.
Full refreshes are always used in these three cases:
- A Reflection is partitioned on one or more fields.
- A Reflection is created on a table that was promoted from a file, rather than from a folder, or is created on a view that is based on such a table.
- A Reflection is created from a view that uses nested group-bys, joins, unions, or window functions.
Specify the Reflection Refresh Policy
In the settings for a data source, you specify the refresh policy for refreshes of all Reflections that are on the tables in that data source. The default policy is period-based, with one hour between each refresh. If you select a schedule policy, the default is every day at 8:00 a.m. (UTC).
In the settings for a table that is not in the Iceberg or Delta Lake format, you can specify the type of refresh to use for all Reflections that are ultimately derived from the table. The default refresh type is Full refresh.
For tables in all supported table formats, you can specify a refresh policy for Reflection refreshes that overrides the policy specified in the settings for the table's data source. The default policy is the schedule set at the source of the table.
To set the refresh policy on a data source:
-
In the Dremio console, right-click a data lake or external source.
-
Select Edit Details.
-
In the sidebar of the Edit Source window, select Reflection Refresh.
-
When you are done making your selections, click Save. Your changes go into effect immediately.
To edit the refresh policy on a table:
-
Locate the table.
-
Hover over the row in which it appears and click
to the right. -
Select Reflection Refresh in the dataset settings sidebar.
-
When you are done making your selections, click Save. Your changes go into effect immediately.
Types of Refresh Policies
Datasets and sources can set Reflections to refresh according to the following policy types:
| Refresh policy type | Description |
|---|---|
| Never | Reflections are not refreshed. |
| Period (default) | Reflections refresh at the specified number of hours, days, or weeks. The default refresh period is one hour. |
| Schedule | Reflections refresh at a specific time on the specified days of the week, in UTC. The default is every day at 8:00 a.m. (UTC). |
| Auto refresh when Iceberg table data changes | Reflections automatically refresh for underlying Iceberg tables whenever new updates occur. Reflections under this policy type are known as Live Reflections. Live Reflections are also updated based on the minimum refresh frequency defined by the source-level policy. This refresh policy is only available for data sources that support the Iceberg table format. |
Set the Reflection Expiration Policy
Rather than delete a Reflection manually, you can specify how long you want Dremio to retain the Reflection before deleting it automatically.
Dremio does not allow expiration policies to be set on external Reflections or Reflections that automatically refresh when Iceberg data changes according to the refresh policy.
To set the expiration policy for all Reflections derived from tables in a data source:
-
Right-click a data lake or external source.
-
Select Edit Details.
-
Select Reflection Refresh in the edit source sidebar.
-
After making your changes, click Save. The changes take effect on the next refresh.
To set the expiration policy on Reflections derived from a particular table:
The table must be based on more than one file.
-
Locate a table.
-
Click the
to its right. -
Select Reflection Refresh in the dataset settings sidebar.
-
After making your changes, click Save. The changes take effect on the next refresh.
View the Reflection Refresh History
You can find out whether a refresh job for a Reflection has run, and how many times refresh jobs for a Reflection have been run.
To view the refresh history:
-
In the Dremio console, go to the catalog or folder that lists the table or view from which the Reflection was created.
-
Hover over the row for the table or view.
-
In the Actions field, click
. -
Select Reflections in the dataset settings sidebar.
-
Click History in the heading for the Reflection.
The Jobs page is opened with the ID of the Reflection in the search box, and only jobs related to that ID are listed.
When a Reflection is refreshed, Dremio runs a single job with two steps:
- The first step writes the query results as a materialization to the distributed acceleration storage by running a
REFRESH REFLECTIONcommand. - The second step registers the materialization table and its metadata with the catalog so that the query optimizer can find the Reflection's definition and structure.
The following screenshot shows the REFRESH REFLECTION command used to refresh the Reflection named Super-duper reflection:
The Reflection refresh is listed as a single job on the Jobs page, as shown in the example below:
To find out which type of refresh was performed:
- Click the ID of the job that ran the
REFRESH REFLECTIONcommand. - Click the Raw Profile tab.
- Click the Planning tab.
- Scroll down to the Refresh Decision section.
Retry a Reflection Refresh Policy
When a Reflection refresh job fails, Dremio retries the refresh according to a uniform policy. This policy is designed to balance resource consumption with the need to keep Reflection data up to date. It prioritizes newly failed Reflections to reduce excessive retries on persistent failures and helps ensure that Reflection data does not become overly stale.
After a refresh failure, Dremio's default is to repeat the refresh attempt at exponential intervals up to 4 hours: 1 minute, 2 minutes, 5 minutes, 15 minutes, 30 minutes, 1 hour, 2 hours, and 4 hours. Then, Dremio continues trying to refresh the Reflection every 4 hours.
There are two optimizations for special cases:
-
Long-running refresh jobs: The backoff interval will never be shorter than the last successful duration.
-
Small maximum retry attempts: At least one 4-hour backoff attempt is guaranteed to ensure meaningful coverage of the retry policy.
Dremio stops retrying after 24 attempts, which typically takes about 71 hours and 52 minutes, or when the 72-hour retry window is reached, whichever comes first.
To configure a different maximum number of retry attempts for Reflection refreshes than Dremio's default of 24 retries:
- Click
in the left navbar. - Select Reflections in the left sidebar.
- On the Reflections page, click
in the top-right corner and select Acceleration Settings. - In the field next to Maximum attempts for Reflection job failures, specify the maximum number of retries.
- Click Save. The change goes into effect immediately.
Dremio applies the retry policy after a refresh failure for all types of Reflection refreshes, no matter whether the refresh was triggered or set by a refresh policy.
Trigger Reflection Refreshes
You can click a button to start the refresh of all of the Reflections that are defined on a table or on views derived from that table.
To trigger a refresh manually:
- Locate the table.
- Hover over the row in which it appears and click
to the right. - In the sidebar of the Dataset Settings window, click Reflection Refresh.
- Click Refresh Now. The message "All dependent Reflections will be refreshed." appears at the top of the screen.
- Click Save.
You can refresh Reflections by using the Reflection API, the Catalog API, and the SQL commands ALTER TABLE and ALTER VIEW.
- With the Reflection API, you specify the ID of a Reflection. See Refresh a Reflection.
- With the Catalog API, you specify the ID of a table or view that the Reflections are defined on. See Refresh the Reflections on a Table and Refresh the Reflections on a View.
- With the
ALTER TABLEandALTER VIEWcommands, you specify the path and name of the table or view that the Reflections are defined on.
The refresh action follows this logic for the Reflection API:
-
If the Reflection is defined on a view, the action refreshes all Reflections that are defined on the tables and on downstream/dependent views that the anchor view is itself defined on.
-
If the Reflection is defined on a table, the action refreshes the Reflections that are defined on the table and all Reflections that are defined on the downstream/dependent views of the anchor table.
The refresh action follows similar logic for the Catalog API and the SQL commands:
-
If the action is started on a view, it refreshes all Reflections that are defined on the tables and on downstream/dependent views that the view is itself defined on.
-
If the action is started on a table, it refreshes the Reflections that are defined on the table and all Reflections that are defined on the downstream/dependent views of the anchor table.
For example, suppose that you had the following tables and views, with Reflections R1 through R5 defined on them:
View2(R5)
/ \
View1(R3) Table3(R4)
/ \
Table1(R1) Table2(R2)
- Refreshing Reflection R5 through the API also refreshes R1, R2, R3, and R4.
- Refreshing Reflection R4 through the API also refreshes R5.
- Refreshing Reflection R3 through the API also refreshes R1, R2, and R5.
- Refreshing Reflection R2 through the API also refreshes R3 and R5.
- Refreshing Reflection R1 through the API also refreshes R3 and R5.
Obtain Reflection IDs
You will need one or more Reflection IDs for some of the Reflection hints. Reflection IDs can be found in three places: the Acceleration section of the raw profile of the job that ran a query using the Reflection, the SYS.PROJECT.REFLECTIONS system table, and the Reflection summary objects that you retrieve with the Reflection API.
To find the ID of a Reflection in Acceleration section of the raw profile of job that ran a query that used the Reflection:
-
In the Dremio console, click
in the side navigation bar. -
In the list of jobs, locate the job that ran the query. If the query was satisfied by a Reflection,
appears after the name of the user who ran the query. -
Click the ID of the job.
-
Click Raw Profile at the top of the page.
-
Click the Acceleration tab.
-
In the Reflection Outcome section, locate the ID of the Reflection.
To find the ID of a Reflection in the SYS.PROJECT.REFLECTIONS system table:
-
In the Dremio console, click
in the left navbar. -
Copy this query and paste it into the SQL editor:
Query for listing info about all existing ReflectionsSELECT * FROM SYS.PROJECT.REFLECTIONS -
Sort the results on the
dataset_namecolumn. -
In the
dataset_namecolumn, locate the name of the dataset that the Reflection was defined on. -
Scroll the table to the right to look through the display columns, dimensions, measures, sort columns, and partition columns to find the combination of attributes that define the Reflection.
-
Scroll the table all the way to the left to find the ID of the Reflection.
To find the ID of a Reflection by using REST APIs: