Refreshing Data Reflections
Refresh Policy: Refresh Interval and Expiration
The system periodically updates the reflections in the Reflection Store to keep Data Reflections fresh. An administrator can specify the desired Refresh Policy for any physical dataset or data source – determining the refresh interval and expiration of reflections. All reflections based on a physical dataset or source will be refreshed accordingly. Refresh Policy options for a physical dataset will override the value for the source.
Dremio will refresh Data Reflections at the provided refresh interval and serve them until the provided expiration.
Manual Refresh: Disabling and enabling reflections for a dataset in Dremio UI will cause that reflections to refresh. Also for a given physical dataset, all dependent reflections can be refreshed.
Full and Incremental Refresh
Dremio’s default behavior is to perform a full update of the Data Reflection on each update. However, for larger datasets it is better to enable incremental updates. There are two ways in which the system can identify new records:
- Directory datasets in file-based data sources like S3 and HDFS. The system can automatically identify new files in the directory.
- All other datasets (physical and virtual). An administrator specifies a monotonically increasing field such as an auto-incrementing key that must be of type BigInt. Incremental updating is not available for datasets without any BigInt fields. This allows the system to fetch the records that have been created since the last time the acceleration was updated.
As of Dremio 3.2, incremental refresh is supported for datasets with columns fields of
BigInt, Int, Timestamp, Date, Varchar, Float, Double, and Decimal data types.
In releases prior to Dremio 3.2, incremental refresh is supported for datasets with BigInt columns only.
To specify incremental refresh for your dataset:
- Go to your source’s promoted folder.
- Click on the settings icon for the promoted folder.
- Select Reflection Refresh.
- Select Incremental Update.
Only append-only datasets are supported for Incremental Update Mode. Updates and deletions of underlying files leads to incorrect results. Dremio recommends using Full Refresh in this case.
Reflections on virtual datasets that include joins cannot be incrementally updated. Dremio falls back to using full refresh for these datasets.
Routing Refresh Jobs to Particular Queues
You can use an SQL command to route jobs for refreshing reflections directly to specified queues. See Queue Routing in the SQL reference.
Changes to Anchor and Upstream Datasets
Changes in definitions of anchor and/or upstream (i.e. parents, parents of parents) datasets require administrators to re-create affected reflections (including reflections on downstream datasets) to ensure that they are up-to-date.
Dremio guarantees data correctness without any modifications, however,
if affected reflections are not re-created when dataset definitions change,
queries may not be able to use those reflections.
Updating a reflection definition causes a full refresh of that reflection.