Skip to main content
Version: 24.3.x

Refreshing Reflections

The data in a reflection can become stale and need to be refreshed. The refresh of a reflection causes two updates:

  • The data stored in the Apache Iceberg table for the reflection is updated.
  • The metadata that stores details about the reflection is updated.

Both of these updates are implied in the term "reflection refresh".

note

Dremio does not refresh the data that external reflections are mapped to.

Types of Reflection Refresh

How reflections are refreshed depend on the format of the base table.

Types of Refresh for Reflections on Apache Iceberg Tables and on Certain Types of Datasets in Filesystem Sources, Glue Sources, and Hive Sources

There are two methods that can be used to refresh reflections that are defined either on Iceberg tables or on these types of datasets in filesystem, Glue, and Hive sources:

  • Parquet datasets in Filesystem sources (on S3, ADLS, GCS, or HDFS)
  • Parquet datasets, Avro datasets, or non-transactional ORC datasets on Glue or Hive (Hive 2 or Hive 3) sources

Iceberg tables in all supported file-system sources (Amazon S3, Azure Data Lake Storage, Google Cloud Storage, and HDFS) and non-file-system sources (AWS Glue, Hive, and Nessie) can be refreshed with either of these methods.

Incremental refreshes

There are two types of incremental refreshes:

  • Incremental refreshes when changes to the base table are only append operations

    This type of incremental refresh is used only when the changes to the base table are appends.

    Refreshes are based on the differences between the current snapshot of the base table and the snapshot at the time of the last reflection refresh. The changes to the base table must be appends only.

  • Incremental refreshes when changes to the base table include non-append operations

    For Iceberg tables, this type of incremental refresh is used when the changes are DML operations (INSERT, UPDATE, DELETE, etc.) made either through the Copy-on-Write (COW) or the Merge-on-Read (MOR) storage mechanism. For more information about COW and MOR, see Row-Level Changes on the Lakehouse: Copy-On-Write vs. Merge-On-Read in Apache Iceberg.

    For sources in filesystems, Glue, or Hive, non-append operations can include, for example:

    • In filesystem sources, files being added to or deleted from Parquet datasets
    • In Glue and Hive sources, DML-equivalent operations being performed on Parquet datasets, Avro datasets, or non-transactional ORC datasets

    note
    • If a reflection is defined on a view, partition-based incremental refreshes are possible only if the view is defined on a single, partitioned table.
    • The initial refresh of a reflection is always a full refresh.

    Refreshes are based on Iceberg metadata that is used to identify modified partitions and to restrict the scope of the refresh to only those partitions.

    note

    Dremio uses Iceberg tables to store metadata for filesystem, Glue, and Hive sources.

    Both the base table and the reflection must be partitioned, and the partition transforms that they use must be compatible.

    For information about partitioning reflections and applying partition transforms, see the section "Horizontally Partition Reflections that Have Many Rows" in Best Practices for Creating Raw and Aggregation Reflections.

    For information about partitioning reflections in ways that are compatible with the partitioning of base tables, see "Partition Reflections to Allow for Partition-Based Incremental Refreshes" in Best Practices for Creating Raw and Aggregation Reflections.

Full refreshes

In a full refresh, the reflection being refreshed is dropped, recreated, and loaded.

Algorithm for Determining Whether an Incremental or a Full Refresh is Used

The following algorithm determines which refresh method is used:

  1. The initial refresh of a reflection is always a full refresh.
  2. If the reflection is created from a view that uses nested group-bys, joins, unions, or window functions, then a full refresh is performed.
  3. If the changes to the base table are only appends, then an incremental refresh based on table snapshots is performed.
  4. If the changes to the base table include non-append operations, then a partition-based incremental refresh is attempted.
  5. If the partitions of the base table and the partitions of the reflection are not compatible, or if either the base table or the reflection is not partitioned, then a full refresh is performed.

Because of this algorithm, you do not select a type of refresh for reflections in the settings of the base table.

note

If the partition scheme of the anchor table has been changed since the last refresh to be incompatible with the partitioning scheme of a reflection, and if changes have occurred to data belonging to a prior partition scheme or the new partition scheme, the reflection will be refreshed in full. To avoid this, update the partition scheme for reflection to match the partition scheme for the table. You do so in the Advanced reflection editor or through the ALTER DATASET SQL command.

Type of Refresh for Reflections on Delta Lake tables

Only full refreshes are supported. In a full refresh, the reflection being refreshed is dropped, recreated, and loaded.

Types of Refresh for Reflections on all other tables

  • Incremental refreshes

    Dremio appends data to the existing data for a reflection. Incremental refreshes are faster than full refreshes for large reflections, and are appropriate for reflections that are defined on tables that are not partitioned.

    There are two ways in which Dremio can identify new records:

    • For directory datasets in file-based data sources like S3 and HDFS: Dremio can automatically identify new files in the directory that were added after the prior refresh.
    • For all other datasets (such as datasets in relational or NoSQL databases): An administrator specifies a strictly monotonically increasing field, such as an auto-incrementing key, that must be of type BigInt, Int, Timestamp, Date, Varchar, Float, Double, or Decimal. This allows Dremio to find and fetch the records that have been created since the last time the acceleration was incrementally refreshed.
    caution

    Use incremental refreshes only for reflections that are based on tables and views that are appended to. If records can be updated or deleted in a table or view, use full refreshes for the reflections that are based on that table or view.

  • Full refreshes

    In a full refresh, the reflection being refreshed is dropped, recreated, and loaded.

    Full refreshes are always used in these three cases:

    • A reflection is partitioned on one or more fields.
    • A reflection is created on a table that was promoted from a file, rather than from a folder, or is created on a view that is based on such a table.
    • A reflection is created from a view that uses nested group-bys, joins, unions, or window functions.

Best practice: Time reflection refreshes to occur after metadata refreshes of tables

Time your refresh reflections to occur only after the metadata for their underlying tables is refreshed. Otherwise, reflection refreshes do not include data from any files that were added to a table since the last metadata refresh, if any files were added.

For example, suppose a data source that is promoted to a table consists of 10,000 files, and that the metadata refresh for the table is set to happen every three hours. Subsequently, reflections are created from views on that table, and the refresh of reflections on the table is set to occur every hour.

Now, one thousand files are added to the table. Before the next metadata refresh, the reflections are refreshed twice, yet the refreshes do not add data from those one thousand files. Only on the third refresh of the reflections does data from those files get added to the reflections.

Setting the Refresh Policy for Reflections

In the settings for a data source, you specify the refresh policy for refreshes of all reflections that are on the tables in that data source. The default policy is period-based, with one hour between each refresh. If you select a schedule policy, the default is every day at 8:00 a.m. (UTC).

In the settings for a table that is not in the Iceberg or Delta Lake format, you can specify the type of refresh to use for all reflections that are ultimately derived from the table. The default refresh type is Full refresh.

For tables in all supported table formats, you can specify a refresh policy for reflection refreshes that overrides the policy specified in the settings for the table's data source. The default policy is the schedule set at the source of the table.

Types of Refresh Policies

Datasets and sources can set reflections to refresh according to the following policy types:

  • Period (default): Reflections refresh at the specified number of hours, days, or weeks. The default period refresh is one hour.
  • Schedule: Reflections refresh at a specific time on the specified days of the week, in UTC. The default is every day at 8:00 a.m. UTC.
  • Never: Reflections never refresh.

Procedures for Setting and Editing the Refresh Policy

To set the refresh policy on a data source:

  1. Right-click a data lake or external source.
  2. Select Edit Details.
  3. In the sidebar of the Edit Source window, select Reflection Refresh.
  4. When you are done making your selections, click Save. Your changes go into effect immediately.

To edit the refresh policy on a table:

  1. Locate the table.
  2. Hover over the row in which it appears and click The Settings icon to the right.
  3. In the sidebar of the Dataset Settings window, click Reflection Refresh.
  4. When you are done making your selections, click Save. Your changes go into effect immediately.

Manually Triggering a Refresh

You can click a button to start the refresh of all of the reflections that are defined on a table or on views derived from that table.

To trigger a refresh:

  1. Locate the table.
  2. Hover over the row in which it appears and click The Settings icon to the right.
  3. In the sidebar of the Dataset Settings window, click Reflection Refresh.
  4. Click Refresh Now. The message "All dependent reflections will be refreshed." appears at the top of the screen.
  5. Click Save.

Viewing the Refresh History for Reflections

You can find out whether a refresh job for a reflection has run, and how many times refresh jobs for a reflection have been run.

Procedure

  1. Go to the space that lists the table or view from which the reflection was created.
  2. Hover over the row for the table or view.
  3. In the Actions field, click The Settings icon.
  4. In the sidebar of the Dataset Settings window, select Reflections.
  5. Click History in the heading for the reflection.

Result

The Jobs page is opened with the ID of the reflection in the search box and only jobs related to that ID listed.

When a reflection is created or refreshed, Dremio runs two jobs by default:

  • The first writes the query results as a materialization to the distributed acceleration storage by running a REFRESH REFLECTION command.
  • The second runs a LOAD MATERIALIZATION METADATA command to create metadata that the query optimizer can use to find out the definition and structure of the reflection.

To find out which type of refresh was performed:

  1. Click the ID of the job that ran the REFRESH REFLECTION command.
  2. Click the Raw Profile tab.
  3. Click the Planning tab.
  4. Scroll down to the Refresh Decision section.

Setting the Maximum Number of Times to Reattempt Reflection Refreshes

You can specify how many times a job should retry refreshing a reflection after an attempt fails. Doing so can help keep reflection-refresh jobs moving at an acceptable rate through an engine queue, so that the data in the corresponding reflections does not become too stale.

  • After a failure to refresh a reflection that has never been refreshed before, further attempts are made every 30 minutes for a maximum of n retries, where n is the number that you specified.

  • After a failure to refresh a reflection that has been refreshed before at least once, the retry behavior depends on the Reflection Refresh settings on the underlying table:

    • If Never refresh is selected, then the reflection is refreshed only when the Refresh Now button is clicked.

    • If Never refresh is not selected, then the reflection is set to be refreshed automatically at specified intervals. After the current interval elapses, a refresh is attempted again. If this second attempt fails, a third attempt is made at the end of the next interval. After every failure, another attempt is made, up to n retries, where n is the number that you specified.

Procedure

  1. Click The Settings icon in the left navbar.
  2. Select Reflections in the left sidebar.
  3. On the Reflections page, click The Settings icon in the top-right corner and select Acceleration Settings.
  4. In the Maximum attempts for reflection job failures field, specify the number of retries to allow.
  5. Click Save. The change goes into effect immediately.

Routing Refresh Jobs to Particular Queues

You can use an SQL command to route jobs for refreshing reflections directly to specified queues. See Queue Routing in the SQL reference.