On this page

    Querying Files and Directories

    This topic describes how to query file system data and directories.

    In order to query a file or directory:

    • The file or directory must be configured as a dataset.
    • The Dremio cluster must be connected to Amazon S3, HDFS, or your NAS.
    • The file formats must be one of the following:
      • Parquet
      • JSON
      • Delimited files
      • Excel, and others.

    Configuring Files as Datasets

    To configure individual files as datasets:

    1. Click on the dataset configuration button.

    2. Hover over the file you want to configure.

    3. Click the configuration button on the right that shows a directory pointing to a directory with a table icon.
      A dialog displays dataset configuration. Depending on the format of the file, different options are available in this dialog. For this TXT file, for example, you would configure the delimiters and other options.

    4. Click Save and view the newly created the dataset.

    To view the new dataset, navigate back to the directory where the file is stored. The file is now as a physical dataset.

    Configuring Directories as Datasets

    Groups of files with the same structure in a common directory can be queried together like they are a single table.

    To configure a directory as a dataset:

    1. Navigate into the filesystem data source you have set up in Dremio, such as HDFS. You will see a list of directories.

    2. Click on this directory, to see the files.

    3. (Optional) Configure each of these files to make them a dataset that Dremio can query. Alternately, If all the files share a common structure, you can configure the directory as a dataset, and all the files will be queried together as if they are a single table.

    To configure the directory:

    1. Hover over the directory to view the configuration button.

    2. Click the button on the right that shows a directory pointing to a directory with a table icon.
      Next you will see the dialog for configuring the data in the directory, similar to the dialog for configuring a single file.

      Dremio will sample several files in the directory to guide you through the setup. The options presented here will depend on the format of the files in the directory.

    3. Click Save and view contents of the directory. The directory contents are a single dataset.

    To view the new dataset, navigate back to the datasource. The directory is now listed as a physical dataset instead of a directory.

    Partitioned Datasets

    When working with partitioned dataset, Dremio automatically discovers partition directory structures and makes partition values available as additional fields for that dataset.


    Hive/Glue sources may only contain a maximum of 300,000 partitions by default.

    Example Directory Structure

    For the following directory structure, Dremio includes three (3) additional fields (named dirN) that represent the values for the partitions.

    -- (e.g. myTable / 2018 / February / 15)

    The top directory (year) is called dir0, the second dir1 (month) and the third dir2 (day).


    Querying Partitioned Datasets

    When querying these datasets, having filters on the partition columns restricts Dremio to only accessing and scanning relevant partitions, greatly enhancing query performance.

    Running Queries on Parquet-based Datasets

    When running queries with filters on Parquet-based datasets, if there are files that only include a single value for a field included in the filter condition, Dremio accesses and scan only relevant files – even if there isn’t any explicit directory structure for partitioning. This is achieved by inspecting Parquet file footers and using this information for partition pruning at query time.