Querying Files and Directories
This topic describes how to query file system data and directories.
In order to query a file or directory:
- The file or directory must be configured as a dataset.
- The Dremio cluster must be connected to Amazon S3, HDFS, or your NAS.
- The file formats must be one of the following:
- Delimited files
- Excel, and others.
Configuring Files as Datasets
To configure individual files as datasets:
Click on the dataset configuration button.
Hover over the file you want to configure.
Click the configuration button on the right that shows a directory pointing to a directory with a table icon.
A dialog displays dataset configuration. Depending on the format of the file, different options are available in this dialog. For this TXT file, for example, you would configure the delimiters and other options.
Click Save and view the newly created the dataset.
To view the new dataset, navigate back to the directory where the file is stored. The file is now as a physical dataset.
Configuring Directories as Datasets
Groups of files with the same structure in a common directory can be queried together like they are a single table.
To configure a directory as a dataset:
Navigate into the filesystem data source you have set up in Dremio, such as HDFS. You will see a list of directories.
Click on this directory, to see the files.
(Optional) Configure each of these files to make them a dataset that Dremio can query. Alternately, If all the files share a common structure, you can configure the directory as a dataset, and all the files will be queried together as if they are a single table.
To configure the directory:
Hover over the directory to view the configuration button.
Click the button on the right that shows a directory pointing to a directory with a table icon.
Next you will see the dialog for configuring the data in the directory, similar to the dialog for configuring a single file.
Dremio will sample several files in the directory to guide you through the setup. The options presented here will depend on the format of the files in the directory.
Click Save and view contents of the directory. The directory contents are a single dataset.
To view the new dataset, navigate back to the datasource. The directory is now listed as a physical dataset instead of a directory.
When working with partitioned dataset, Dremio automatically discovers partition directory structures and makes partition values available as additional fields for that dataset.
Hive/Glue sources may only contain a maximum of 300,000 partitions by default.
Example Directory Structure
For the following directory structure,
Dremio includes three (3) additional fields (named
dirN) that represent the values for the partitions.
<DATASET_NAME> / <YEAR_VALUE> / <MONTH_VALUE> / <DAY_VALUE> -- (e.g. myTable / 2018 / February / 15)
The top directory (year) is called
dir0, the second
dir1 (month) and the third
Querying Partitioned Datasets
When querying these datasets, having filters on the partition columns restricts Dremio to only accessing and scanning relevant partitions, greatly enhancing query performance.
Running Queries on Parquet-based Datasets
When running queries with filters on Parquet-based datasets, if there are files that only include a single value for a field included in the filter condition, Dremio accesses and scan only relevant files – even if there isn’t any explicit directory structure for partitioning. This is achieved by inspecting Parquet file footers and using this information for partition pruning at query time.