On this page

    Querying Files and Directories

    This topic describes how to query file system data and directories.

    In order to query a file or directory:

    • The file or directory must be configured as a dataset.
    • The Dremio cluster must be connected to Amazon S3, HDFS, or your NAS.
    • The file formats must be one of the following:
      • Parquet
      • JSON
      • Delimited text files
      • Excel, and others.

    note:

    Dremio can read delimited text files and JSON files that are compressed in GZIP (.gz) files. You can promote such files just as you would promote non-compressed ones. Dremio does not recognize file formats automatically when files are compressed, so you must select the format manually when promoting.

    Configuring Files as Datasets

    To configure individual files as datasets:

    1. Click on the dataset configuration button.

    2. Hover over the file you want to configure.

    3. Click the configuration button on the right that shows a directory pointing to a directory with a table icon.
      A dialog displays dataset configuration. Depending on the format of the file, different options are available in this dialog. For this TXT file, for example, you would configure the delimiters and other options.

    4. Click Save and view the newly created the dataset.

    To view the new dataset, navigate back to the directory where the file is stored. The file is now as a physical dataset.

    Configuring Directories as Datasets

    Groups of files with the same structure in a common directory can be queried together like they are a single table.

    To configure a directory as a dataset:

    1. Navigate into the filesystem data source you have set up in Dremio, such as HDFS. You will see a list of directories.

    2. Click on this directory, to see the files.

    3. (Optional) Configure each of these files to make them a dataset that Dremio can query. Alternately, If all the files share a common structure, you can configure the directory as a dataset, and all the files will be queried together as if they are a single table.

    To configure the directory:

    1. Hover over the directory to view the configuration button.

    2. Click the button on the right that shows a directory pointing to a directory with a table icon.
      Next you will see the dialog for configuring the data in the directory, similar to the dialog for configuring a single file.

      Dremio will sample several files in the directory to guide you through the setup. The options presented here will depend on the format of the files in the directory.

    3. Click Save and view contents of the directory. The directory contents are a single dataset.

    To view the new dataset, navigate back to the datasource. The directory is now listed as a physical dataset instead of a directory.

    Partitioned Data

    The data in a source dataset might be partitioned into one or more levels of subfolders, one level for each partition column. In such cases, when you format the source directory as a physical dataset, Dremio Cloud appends to the physical dataset one column per partition. The data type of the appended columns is varchar.

    Examples

    Example 1

    The source dataset orders is partitioned on the column state. Each subfolder is named state=<abbreviation>, where <abbreviation> is the two-letter abbreviation of the name of a US state.

    When you format orders as a physical dataset, all of the columns from the Parquet files, except state, are included, and Dremio appends the column dir0, which has the data type varchar. The values in that column are state=AK for the rows from the file 0.parquet, state=AL for the rows from the file 1.parquet, state=AR for the rows from the file 2.parquet, and so on.

    Example 2

    The source dataset orders is partitioned on the columns state and zipCode. Each first-level subfolder is named state=<abbreviation>, where <abbreviation> is the two-letter abbreviation of the name of a US state. Each second-level subfolder is named zipCode=<zip code>.

    When you format orders as a physical dataset, all of the columns from the Parquet files, except state and zipCode, are included, and Dremio appends the columns dir0 and dir1, which both have the data type varchar.

    The values in dir0 are state=AK for all rows in which the value in dir1 is zipCode=<zip code in AK>, state=AL for all rows in which the value in dir1 is zipCode=<zip code in Al>, and so on.

    The values in dir1 are zipCode=99502 for the rows from 0.parquet, zipCode=99503 for the rows from 1.parquet, and so on.

    Partition Column Inference

    By default, when a source dataset uses Parquet files and the data is partitioned on one or more columns, Dremio behaves as described in Partitioned Data. However, if you select the option Enable partition column inference in the advanced options for a data source, you change how Dremio handles partition columns.

    In addition to appending a column named dir<n> for each partition level and using subfolder names for values in those columns, Dremio detects the name of the partition column, appends a column that uses that name, detects values in the names of subfolders, and uses those values in the appended column.

    Appended columns still use the varchar data type.

    Examples

    Example 1

    The source dataset orders is partitioned on the column state. Each subfolder is named state=<abbreviation>, where <abbreviation> is the two-letter abbreviation of the name of a US state.

    When you format orders as a physical dataset, all of the columns from the Parquet files are included, and Dremio appends the columns dir0 and state, both of which use the varchar data type.

    The values in dir0 are state=AK for the rows from the file 0.parquet, state=AL for the rows from the file 1.parquet, state=AR for the rows from the file 2.parquet, and so on.

    The values in state are AK for the rows from the file 0.parquet, AL for the rows from the file 1.parquet, AR for the rows from the file 2.parquet, and so on.

    Example 2

    The source dataset orders is partitioned on the columns state and zipCode. Each first-level subfolder is named state=<abbreviation>, where <abbreviation> is the two-letter abbreviation of the name of a US state. Each second-level subfolder is named zipCode=<zip code>.

    When you format orders as a physical dataset, all of the columns from the Parquet files are included, and Dremio appends the columns dir0, dir1, state, and zipCode, all of which use the varchar data type.

    The values in dir0 are state=AK for all rows in which the value in dir1 is zipCode=<zip code in AK>, state=AL for all rows in which the value in dir1 is zipCode=<zip code in Al>, and so on.

    The values in dir1 are zipCode=99502 for the rows from 0.parquet, zipCode=99503 for the rows from 1.parquet, and so on.

    The values in state are AK for all rows in which the value in zipCode is <zip code in AK>, AL for all rows in which the value in zipCode is <zip code in Al>, and so on.

    The values in zipCode are 99502 for the rows from 0.parquet, 99503 for the rows from 1.parquet, and so on.

    Requirements

    For the Enable partition column inference option to work correctly, ensure that the names of your subfolders meet these requirements:

    • Names must be in the format colName=<colValue>. colName= is valid.
    • Names must meet Dremio’s naming conventions for columns.
    • Names must be unique within and across directory levels.
    • Names must not be present in data files.
    • All Parquet files in the source dataset must be in leaf subfolders.

    How Dremio Handles Existing Physical Datasets

    If you enable the Enable partition column inference option, and already have one or more physical datasets that are based on sources that use Parquet files and that are partitioned, those existing physical datasets remain as they are until you run the ALTER command twice on each. The first time, you run the command to cause Dremio to forget the metadata for the physical dataset. The second time, you run the command to cause Dremio to refresh the metadata. The commands are listed here on each of those physical datasets.

    For example, before you enable the Enable partition column inference option, your orders physical dataset might have these columns:

    orderID multiple columns dir0
    000001 state=CA
    000002 state=WA

    Suppose that you enable the Enable partition column inference option. The columns in the physical dataset remain the same. If you want to take advantage of partition column inference, you run these two SQL commands:

    SQL commands for partition column inference
    ALTER TABLE path.to.orders FORGET METADATA
    ALTER TABLE path.to.orders REFRESH METADATA
    

    As a result of the second ALTER command, Dremio adds the column state:

    orderID multiple columns dir0 state
    000001 state=CA CA
    000002 state=WA WA

    Because Dremio appends the new column, any virtual datasets that are defined on the physical dataset and that use the dir0 column are still valid. When you define new virtual datasets, you can use the appended column.

    Enabling Partition Column Inference

    After you follow the steps in either of these procedures, Dremio uses partition column inference for all source datasets that you promote to physical datasets.

    To enable partition column inference for a new source:

    1. On the Datasets page, click Add Source below the list of sources.
    2. Select Advanced Options.
    3. Select Enable partition column inference.
    4. Specify any other settings that you want for your new data source.
    5. Click Save.

    To enable partition column inference for an existing source:

    1. On the Datasets page, click the name of the data source.
    2. In the top-right corner of the page, click the gear icon.
    3. In the Edit Source dialog, select Advanced Options.
    4. Select Enable partition column inference.
    5. Click Save.
    6. If there are existing physical datasets that are based on datasets in the current data source, run the two ALTER commands described here on each of those physical datasets.

    note:

    If you change the partitioning schema of a source dataset after enabling partition column inference, metadata refreshes of all physical datasets defined on the source dataset fail. To resolve this problem, run the two ALTER commands described here on each of the affected physical datasets.

    ALTER Commands to Cause Dremio to Forget and to Refresh Metadata

    When you enable partition column inference on a source, you might have one or more existing physical datasets in Dremio that are based on datasets in that source. Also, you might you enable partition column inference on a source and then change the partition schema of a source dataset that is the basis of one or more physical datasets in Dremio.

    In both cases, you must run these two ALTER commands on each affected physical dataset:

    Forget and refresh metadata
    ALTER TABLE <dataset_path> FORGET METADATA
    ALTER TABLE <dataset_path> REFRESH METADATA