On this page

    Metadata Refresh

    This topic describes how Dremio gathers dataset metadata from external storage systems in addition to the configuration options available for customizing options on a per-source or per-table basis.

    Dremio gathers metadata information for physical datasets from external sources at regular intervals in order to accelerate end-user queries. During query operation, Dremio uses stored metadata to immediately start SQL planning and processing functions. Information gathered for each table includes:

    • The dataset’s table schema including columns, data types, etc
    • The dataset’s table partition layout
    • The list of files that are a part of the dataset table (Data Lake sources only)

    Stored metadata refreshes may occur during the following events:

    • Scheduled Refresh
      • Metadata is automatically refreshed at fixed time intervals, such as once every hour. This is done from the Advanced Options tab of the Settings dialog for the desired source.
    • Inline Refresh (Manual)
      • During query runtime, if Dremio discovers that the metadata has either expired or became invalid, a metadata refresh is triggered and the query restarts after the refresh.
    • Manual Refresh
      • Users can manually trigger an immediate metadata refresh by executing the SQL command ALTER TABLE <table> REFRESH METADATA.

    Improved Metadata Refreshes (Preview)

    note:

    Version Requirement: This enhancement is only available using instances of Dremio v18.0+.

    In Dremio v18.0+, the service supports preview access to the following improvements:

    To activate this functionality, enable the dremio.iceberg.enabled and dremio.execution.support_unlimited_splits flags. This enables fast metadata refreshes and removes the limitation on the number of data splits scanned. If you are using Hive datasets, we recommend also enabling the store.accurate.partition_stats support key as this allows for more accurate partition stats. Enabling these options may be done from the Support Settings page.

    warning:

    Using these support keys will enable new functionalities in Dremio that may cause unexpected behaviors with your existing datasets. We recommend testing this functionality first in a test environment as described here.

    Limitations

    • The Dataset Discovery option is not available for FileSystem sources, such as HDFS, MapR-FS, and NAS.
    • Datasets are limited to a maximum width of 800 columns (as of Dremio v3.1.3). Datasets exceeding the limit may not be queryable once their metadata is refreshed.

    Improved Metadata Refreshes

    Dremio utilizes an improved metadata refresh process to more efficiently capture metadata changes, particularly for tables with large numbers of files.

    When activated, the metadata refresh process is run both in parallel across the execution engine’s nodes for faster refresh and operates incrementally where possible for more efficient refreshes.

    Unlimited Splits

    note:

    As of Dremio v21.0+, this functionality is enabled by default. Otherwise, organizations using Dremio v18.X-20.X must enable this manually using the associated support key.

    With the unlimited splits for FileSystem sources, users may perform queries wherein Dremio processes any number of data splits. All split limitations are removed by default for the following dataset types:

    With this functionality, split limitations are removed for the following source types:

    • FileSystem sources (S3, ADLS, GCS, HDFS) using:
      • Parquet formatted tables
      • Iceberg formatted tables
      • Delta Lake formatted tables
    • Hive sources (Hive 2 and Hive 3) using:
      • Parquet formatted tables
      • Avro formatted tables
      • ORC formatted tables (non-transactional only)

    Refreshing Metadata

    Metadata refreshes pick up the latest metadata changes and occurs one of three ways: scheduled, automatically, or manually.

    Setting Up Scheduled Refreshes

    Scheduling metadata refreshes is done from the Advanced Options tab of the Settings page for a desired source. You need only select the scope of the refresh (all datasets, a single dataset, or as-needed) and when the refresh should occur or expire.

    For more information about Metadata settings for specific data sources, see each data source’s help page.

    Setting Up Automatic Refreshes

    Inline metadata refresh is triggered when Dremio discovers that the metadata has expired; or when Dremio has discovered that metadata is invalid during query execution

    By default, Dremio refreshes metadata when the service discovers that the metadata has expired or metadata is identified as invalid during a query’s execution. In these circumstances, Dremio automatically performs a metadata refresh.

    Performing Manual Refreshes

    To ensure a new recently-written metadata is identified by Dremio, run theALTER TABLE SQL command in an on-demand basis, which guarantees that the most-recent metadata changes are identified. Using the syntax specified here, you may also refresh metadata for individual partitions.

    For example, refreshing metadata for a table could be done manually with the following command:

    ALTER TABLE table1 REFRESH METADATA;
    

    This forces a manual refresh of the table1 table’s metadata.

    note:

    First Time Refreshing: The first time Dremio performs a metadata refresh using this new functionality, it may run slowly. The reason for this is that Dremio is preparing the existing metadata for future refreshes by incrementing existing data. Once this initial refresh “setup” is completed, subsequent refreshes will occur with significantly faster results.

    Refreshing Partition Metadata

    In instances where a table is partitioned and those partitions updated since the last refresh is known, the performance of metadata refresh can be accelerated further by refreshing only the changed partitions. This is particularly useful on large datasets containing numerous files, since it reduces the number of subdirectories that need to be scanned.

    To refresh individual partitions, add the PARTITIONS clause to the ALTER TABLE statement described here. The example below limits refresh to just the partitions requested.

    ALTER TABLE <tableName> REFRESH METADATA FOR PARTITIONS ( "<partitionName>" = '<value>' );