Querying Files and Directories

This topic describes how to query file system data and directories.

In order to query a file or directory:

  • The file or directory must be configured as a dataset.
  • The Dremio cluster must be connected to Amazon S3, HDFS, or your NAS.
  • The file formats must be one of the following:
    • Parquet
    • JSON
    • Delimited files
    • Excel, and others.

[info] Dremio can query compressed files directly.

Configuring Files as Datasets

To configure individual files as datasets:

  1. Click on the dataset configuration button.
  2. Hover over the file you want to configure.

  3. Click the configuration button on the right that shows a directory pointing to a directory with a table icon.
    A dialog displays dataset configuration. Depending on the format of the file, different options are available in this dialog. For this TXT file, for example, you would configure the delimiters and other options.

  4. Click Save and view the newly created the dataset.

To view the new dataset, navigate back to the directory where the file is stored. The file is now as a physical dataset.

Configuring Directories as Datasets

Groups of files with the same structure in a common directory can be queried together like they are a single table.

To configure a directory as a dataset:

  1. Navigate into the filesystem data source you have set up in Dremio, such as HDFS. You will see a list of directories like the following example:

  2. Click on this directory, to see the files.

  3. (Optional) Configure each of these files to make them a dataset that Dremio can query. Alternately, If all the files share a common structure, you can configure the directory as a dataset, and all the files will be queried together as if they are a single table.

To configure the directory:

  1. Hover over the directory to view the configuration button.

  2. Click the button on the right that shows a directory pointing to a directory with a table icon.
    Next you will see the dialog for configuring the data in the directory, similar to the dialog for configuring a single file.

    Dremio will sample several files in the directory to guide you through the setup. The options presented here will depend on the format of the files in the directory.

  3. Click Save and view contents of the directory. The directory contents are a single dataset.

To view the new dataset, navigate back to the datasource. The directory is now listed as a physical dataset instead of a directory.

Partitioned Datasets

When working with partitioned dataset, Dremio automatically discovers partition directory structures and makes partition values available as additional fields for that dataset.

Example Directory Structure

For the following directory structure, Dremio includes three (3) additional fields (named dirN) that represent the values for the partitions.

<DATASET_NAME> / <YEAR_VALUE> / <MONTH_VALUE> / <DAY_VALUE>
-- (e.g. myTable / 2018 / February / 15)

The top directory (year) is called dir0, the second dir1 (month) and the third dir2 (day).

dir0 dir1 dir2
2018 February 15
2018 February 14
2018 February 13

Querying Partitioned Datasets

When querying these datasets, having filters on the partition columns restricts Dremio to only accessing and scanning relevant partitions, greatly enhancing query performance.

Running Queries on Parquet-based Datasets

When running queries with filters on Parquet-based datasets, if there are files that only include a single value for a field included in the filter condition, Dremio accesses and scan only relevant files -- even if there isn't any explicit directory structure for partitioning. This is achieved by inspecting Parquet file footers and using this information for partition pruning at query time.


results matching ""

    No results matching ""