Caching Source Metadata

This topic describes how to configure the cache settings for data source metadata. Various caching options are available for individual data sources.

To configure caching setting for data source metadata:

  1. Open the settings for the data source. Data source configuration settings can be set either when adding the data source or after the data source has been added.
  2. Navigate to Metadata > Metadata Refresh.
  3. Modify the settings for the following:
    • Dataset Discovery
    • Detaset Detail

For more information about Metadata settings for specific data sources, see the each data source. See HDFS for a list of data sources.

Near-Real-Time Metadata Refreshes

Version Requirement:

This enhancement is only available when using instances of Dremio v18.0+.

Metadata refreshes now take place in near-real-time for any dataset size on the following sources and datasets:

  • Hive sources using:
    • Parquet (excluding Hudi) datasets
    • Avro datasets
    • Non-transactional ORC datasets
  • FileSystem sources using:
    • Parquet datasets

In previous versions of Dremio, all metadata refreshes are done in their entirety each time a refresh is scheduled or requested. With near-real-time refreshes, these are now performed incrementally, allowing Dremio to more rapidly identify what data splits need to be read as part of the query.

To activate this functionality, use the dremio.iceberg.enabled and dremio.execution.support_unlimited_splits flags. This enables near-real-time metadata refreshes and removes the limitation on the number of data splits scanned. Enabling support flags is done from the Support Settings page.

Warning:

Using these support keys will enable new functionalities in Dremio that may cause unexpected behaviors with your existing datasets. We recommend testing this functionality first in a test environment as described here.

To perform manual metadata refreshes, make use of the ALTER TABLE SQL command in the SQL editor for near-instant results.

First Time Refreshing:

The first time Dremio performs a metadata refresh using this new functionality, it won’t result in near-real-time results. Dremio is preparing the existing metadata for future refreshes by incrementalizing data. Once this initial refresh “setup” is completed, subsequent refreshes will occur with near-real-time results.

Source Requirement:

This is supported on Hive soures (Parquet, Avro, non-ACID ORC), AWS Glue sources (Parquet, Avro, non-ACID ORC), and FileSystem sources (Parquet).

We recommend also enabling Near-Real-Time Metadata Refreshes for Reflections to increase refresh speeds.

Unlimited Splits

Version Requirement:

This enhancement is only available using instances of Dremio v18.0+.

Enabling near-real-time metadata refreshes also removes the limitation on the total number of splits supported by Dremio.

To activate this functionality, enable the dremio.iceberg.enabled and dremio.execution.support_unlimited_splits flags. This is done from the Support Settings page.

Currently, Arrow caching is not supported with unlimited splits.

Metadata Refresh Settings

This section describes the configurable caching settings.

Metadata Settings

Dataset Discovery

Dataset Discovery option determines the refresh interval for top-level source object names such as names of databases, tables, indexes, etc. The dafault is one hour. This refresh is a lightweight operation. Dataset Discovery option is not available for file-system sources such as HDFS, MapR-FS or NAS.

Dataset Details

Dataset Details is the metadata Dremio needs for query planning such as information on fields, types, shards, statistics and locality information.

The following fetch modes are available:

  • Only Queried Datasets - Dremio updates details for previously queried objects in a source. This mode increases query performance as less work needs to be done at query time for these datasets.
  • All Datasets - (Deprecated as of 3.3) Dremio updates details for all datasets in a source. This mode increases query performance as less work needs to be done at query time.
  • As Needed - (Not Available as of 3.3) Dremio updates details for a dataset at query time. This mode minimizes metadata queries on a source when not used, but might lead to longer planning times.`

Dremio expires the metadata it knows about datasets after the provided Expire after value.

Limitations

  • The Dataset Discovery option is not available for file-system sources such as HDFS, MapR-FS or NAS.
  • Datasets are limited to a maximum width of 800 columns (as of Dremio version 3.1.3). Datasets that have already exceed the limit are not queryable after their metadata is refreshed.