This topic describes how to configure the cache settings for data source metadata. Various caching options are available for individual data sources.
To configure caching setting for data source metadata:
For more information about Metadata settings for specific data sources, see the each data source. See HDFS for a list of data sources.
This enhancement is only available when using instances of Dremio v18.0+.
Metadata refreshes now take place in near-real-time for any dataset size on the following sources and datasets:
In previous versions of Dremio, all metadata refreshes are done in their entirety each time a refresh is scheduled or requested. With near-real-time refreshes, these are now performed incrementally, allowing Dremio to more rapidly identify what data splits need to be read as part of the query.
To activate this functionality, use the
dremio.execution.support_unlimited_splits flags. This enables near-real-time metadata refreshes and removes the limitation on the number of data splits scanned. Enabling support flags is done from the Support Settings page.
Using these support keys will enable new functionalities in Dremio that may cause unexpected behaviors with your existing datasets. We recommend testing this functionality first in a test environment as described here.
To perform manual metadata refreshes, make use of the
ALTER TABLE SQL command in the SQL editor for near-instant results.
First Time Refreshing:
The first time Dremio performs a metadata refresh using this new functionality, it won’t result in near-real-time results. Dremio is preparing the existing metadata for future refreshes by incrementalizing data. Once this initial refresh “setup” is completed, subsequent refreshes will occur with near-real-time results.
This is supported on Hive soures (Parquet, Avro, non-ACID ORC), AWS Glue sources (Parquet, Avro, non-ACID ORC), and FileSystem sources (Parquet).
We recommend also enabling Near-Real-Time Metadata Refreshes for Reflections to increase refresh speeds.
This enhancement is only available using instances of Dremio v18.0+.
Enabling near-real-time metadata refreshes also removes the limitation on the total number of splits supported by Dremio.
To activate this functionality, enable the
dremio.execution.support_unlimited_splits flags. This is done from the Support Settings page.
Currently, Arrow caching is not supported with unlimited splits.
This section describes the configurable caching settings.
Dataset Discovery option determines the refresh interval for top-level source object names such as names of databases, tables, indexes, etc. The dafault is one hour. This refresh is a lightweight operation. Dataset Discovery option is not available for file-system sources such as HDFS, MapR-FS or NAS.
Dataset Details is the metadata Dremio needs for query planning such as information on fields, types, shards, statistics and locality information.
The following fetch modes are available:
Only Queried Datasets- Dremio updates details for previously queried objects in a source. This mode increases query performance as less work needs to be done at query time for these datasets.
All Datasets- (Deprecated as of 3.3) Dremio updates details for all datasets in a source. This mode increases query performance as less work needs to be done at query time.
As Needed- (Not Available as of 3.3) Dremio updates details for a dataset at query time. This mode minimizes metadata queries on a source when not used, but might lead to longer planning times.`
Dremio expires the metadata it knows about datasets after the provided
Expire after value.