On this page

    Formatting Data to a Physical Dataset

    This topic covers how to query the data from your data lake sources by creating physical datasets.

    Overview

    Dremio Cloud allows you to query the data from your data lake sources without ingesting or copying it. After configuring your data source, you can format the data in your source to a physical dataset, which will create a table-like representation of your data so that it can be queried in Dremio Cloud using SQL. You can format individual files or a folder of files, which will create one dataset with the data from the folder. You can query the file or folder without creating a physical dataset but performance may be impacted.

    This functionality is currently only supported with Amazon S3.

    Supported File Formats

    • Apache Iceberg
    • Delta Lake
    • Excel (.xlsx)
    • JSON
    • Parquet
    • Text

    note:

    Promoting folders that contain a mix of file formats is not supported. (All the files in a folder must be of the same format.)

    Formatting a File or Folder as a Physical Dataset

    To format a file or folder as a physical dataset:

    1. In the Datasets UI, navigate to the file or folder that you want to format.

    2. To format a file or folder:

      File: In the Action column, click on the Format File button for the file that you want to format.

      Folder: In the Action column, click on the Format Folder button for the folder that you want to format.

    3. In the Dataset Settings dialog, for Format, verify that the correct format has been detected. The following formats are supported:

    1. After entering specifying your parameters, click Save. The parameter values will be auto-detected but can be altered. When you click Save, your physical dataset will appear in the Dataset UI (represented by a purple dataset icon).

    Supported Formats and Settings

    Excel

    For Excel, you can specify the following parameters:

    • Extract Field Names: Check this box to extract the column names from the first line of the file.
    • Expand Merged Cells: Check this box to expand out cells that have been merged in the Excel sheet.
    • Sheet Name: Specify the sheet name if there are multiple sheets within the file.

    JSON

    There are no parameters to specify for JSON format.

    Parquet

    There are no parameters to specify for parquet format.

    Text (delimited)

    For Text (delimited), you can specify the following parameters:

    • Field Delimiter: Select the delimiter in your text file - Comma, Tab, Pipe, or Custom. For Custom, enter the characters used for a delimiter in the text box.
    • Quote: Select the character that is used for quotes in your file - Single Quote, Double Quote, Custom. For Custom, enter the characters used for quotes in the text box.
    • Comment: Select the character that is used for comments in your file - Number Sign, Double Slash, Custom. For Custom, enter the characters used for comments in the text box.
    • Line Delimiter: Select the character that is used for a line delimiter in your file - CRLF, LF, Custom. For Custom, enter the characters used for a line delimiter in the text box.
    • Escape: Select the character that is used for to escape in your file - Double Quote, Back Quote, Backslack, Custom. For Custom, enter the characters used to escape in the text box.
    • Extract Field Names: Check this box to extract the column names from the first line of the file.
    • Skip First Line: Check this box to skip the first line of the file.
    • Trim Field Names: Check this box to trim the names of the columns (How many characters are they trimmed at??). This option is checked by default.

    Partitioned Data

    When data within is a source is partitioned using folders, Dremio Cloud automatically discovers the structure and creates additional fields (or columns) in the dataset for each partition. When querying partitioned datasets, having filters on the partitioned columns restricts Dremio Cloud to access and scan only the relevant partitions, which greatly enhances query performance.

    Examples

    If the path is the following, after formatting, the physical dataset called monthly_consumption will be created with all the fields in monthly_consumption.parquet with an additional column dir0 that will contain the name of the year and month folder for each row.

    -- <dataset_name>/<year and month>/<name of individual file>
    monthly_consumption/2021-01/monthly_consumption.parquet
    

    If the path is the following, after formatting, the physical dataset called monthly_consumption will be created with all the fields in monthly_consumption.parquet with additional columns dir0 containing the name of the year folder and dir1 containing the name of the month folder for each row.

    -- <dataset_name>/<year>/<month>/<name of individual file>
    monthly_consumption/2020/April/monthly_consumption.parquet
    

    Enabling Automatic Formatting of Data

    You can configure a source to automatically format the data located in the source to physical datasets when a user triggers a query on the data for the first time.

    To configure a data lake source to automatically format data for a source:

    1. In the Data Lakes window, locate the data lake source for which you want to enable this property.

    2. In the Actions column, click the Settings button (gear) and select Edit Details.

    3. In the Edit Source dialog, navigate to the Metadata tab.

    4. In the Metadata tab, click the checkbox to Automatically format files into physical datasets when users issue queries.

    5. Click Save.

    Removing Formatting on Data

    Removing the formatting on a physical dataset will revert the dataset to the folder or file format that it was originally in.

    To remove the formatting on data:

    1. On the Datasets page, locate the physical dataset for which you want to remove formatting.
    2. In the Action column, click the ellipse icon and select Remove Format.
    3. In the Remove Format dialog, confirm that you want to remove formatting for the selected dataset. Any virtual datasets that have been created from this physical dataset will be disconnected from the parent.