Skip to main content

Amazon S3

Amazon S3 is an object storage service from AWS.

Supported Formats

Dremio can query data stored in S3 in file formats (including delimited, Excel (XLSX), JSON, and Parquet) and table formats (including Apache Iceberg and Delta Lake).

Connect to Amazon S3

  1. In the Dremio console, click Add Data on the Home page.
  2. In the Add Data dialog, select Amazon S3.
  3. Configure the connection using the sections below, then click Save.

General

  • Name – In the Name field, enter a name for the Amazon S3 connection. The name cannot include the following special characters: /, :, [, or ].
  • Authentication – Provide the IAM Role ARN that Dremio will assume to gain access to S3:
  • Public Buckets – (Optional) Click Add bucket and enter the public S3 bucket URL. You can add multiple public S3 buckets. AWS credentials are not necessary if you are accessing only public S3 buckets.
  • Encrypt Connection – (Optional) To secure the connections between the S3 buckets and Dremio, select the Encrypt connection checkbox.

Advanced Options

  • Apply requester-pays to S3 requests – The requester (instead of the bucket owner) pays the cost of the S3 request and the data downloaded from the S3 bucket.
  • Enable file status check – Enabled by default; uncheck the box to disable. Enables Dremio to check if a file exists in the S3 bucket before proceeding to handle errors gracefully. Disable this option when there are no files missing from the S3 bucket or when the file's access permissions have not changed. Disabling this option reduces the amount of communication to the S3 bucket.
  • Enable partition column inference – If a dataset uses Parquet files and the data is partitioned on one or more columns, enabling this option will append a column named dir<n> for each partition level and use subfolder names for values in those columns. Dremio detects the name of the partition column, appends a column that uses that name, detects values in the names of subfolders, and uses those values in the appended column.
  • Root Path – The root path for the Amazon S3 bucket. The default root path is /.
    • VPC-restricted S3 buckets are not supported.
  • Server-side encryption key ARN – Add the ARN key created in AWS Key Management Service (KMS) if you want to store passwords in AWS KMS. Ensure that the AWS credentials you share with Dremio have access to this ARN key.
  • Default CTAS Format – Choose the default format for tables you create in Dremio: either Parquet or Iceberg (default).
  • Allowlisted buckets – Add an approved S3 bucket in the text field. You can add multiple S3 buckets. When using this option to add specific S3 buckets, you will only be able to see those buckets and not all the buckets that may be available. Buckets entered must be valid. Misspelled or nonexistent buckets will be ignored.

Under Cache Options:

  • Enable local caching when possible – Selected by default, along with asynchronous access for cloud caching. Uncheck the checkbox to disable this option.
  • Max percent of total available cache space to use when possible – Specifies the disk quota, as a percentage, available for this connection on any single executor node when local caching is enabled. The default is 100 percent of the total disk space available on the mount point provided for caching. You can either manually enter a percentage in the value field or use the arrows to the far right to adjust the percentage.

Reflection Refresh

  • Never refresh: Select to prevent automatic Reflection refresh; otherwise, the default is to refresh automatically.
  • Refresh every: Define how often to refresh Reflections, specified in hours, days, or weeks. This option is ignored if Never refresh is selected.
  • Set refresh schedule: Specify the daily or weekly schedule.
  • Never expire: Select to prevent Reflections from expiring; otherwise, the default is to expire automatically after the time limit specified in Expire after.
  • Expire after: The time limit after which Reflections expire and are removed from Dremio, specified in hours, days, or weeks. This option is ignored if Never expire is selected.

Metadata

Dataset Handling

  • Remove dataset definitions if underlying data is unavailable – By default, Dremio removes dataset definitions if underlying data is unavailable. This option is for scenarios when files are temporarily deleted and added back in the same location with new sets of files.
  • Automatically format files into tables when users issue queries – Enable this option to allow Dremio to automatically format files into tables when you run queries. This option is for scenarios when the data contains CSV files with non-default options.

Metadata Refresh

  • Dataset Discovery – The refresh interval for fetching top-level object names such as databases and tables. Use this parameter to set the time interval. You can choose to set the frequency to fetch object names in minutes, hours, days, or weeks. The default frequency to fetch object names is one hour.
  • Dataset Details – The metadata that Dremio needs for query planning, such as information required for fields, types, shards, statistics, and locality. The following describes the parameters that fetch the dataset information:
    • Fetch mode – You can choose to fetch only from queried datasets, which is set by default. Dremio updates details for previously queried objects. Fetching from all datasets is deprecated.
    • Fetch every – You can choose to set the frequency to fetch dataset details in minutes, hours, days, or weeks. The default frequency to fetch dataset details is one hour.
    • Expire after – You can choose to set the expiry time of dataset details in minutes, hours, days, or weeks. The default expiry time of dataset details is three hours.

Privileges

This connection inherits privileges from Project settings. To grant specific users or roles additional privileges in this connection:

  1. Enter the username or role name that you want to grant access to and click the Add to Privileges button. The added user or role is displayed in the USERS/ROLES table.
  2. For the users or roles in the USERS/ROLES table, toggle the checkmark for each privilege you want to grant on the Dremio source that is being created.
  3. Click Save after setting the configuration.

See Privileges for additional information about privileges.

Edit an Amazon S3 Connection

  1. On the Open Catalog page, under Connections, right-click the connection and select Settings.
  2. Update the connection configuration as needed.
  3. Click Save.

Delete an Amazon S3 Connection

  1. On the Open Catalog page, under Connections, right-click the connection and select Delete.
  2. Click Delete to confirm.

Create an AWS IAM Role

To create an AWS IAM role for this connection:

  1. Sign in to the AWS Identity and Access Management (IAM) console.

  2. From the left menu pane, under Access management, select Roles.

  3. On the Roles page, click Create role.

  4. On the Select trusted entity page:

    • Under Trusted entity type, select the radio button for Custom Trust Policy.

    • Delete the current JSON policy and paste in the custom trust policy template:

      Custom Trust Policy Template
      {
      "Version": "2012-10-17",
      "Statement": [
      {
      "Sid": "AllowAssumeRoleWithExternalId",
      "Effect": "Allow",
      "Principal": {
      "AWS": "arn:aws:iam::<dremio_trust_account>:root"
      },
      "Action": "sts:AssumeRole",
      "Condition": {
      "StringEquals": {
      "sts:ExternalId": "<project_id>"
      }
      }
      },
      {
      "Sid": "AllowTagSessionFromCallerRole",
      "Effect": "Allow",
      "Principal": {
      "AWS": "arn:aws:iam::<dremio_trust_account>:root"
      },
      "Action": "sts:TagSession"
      }
      ]
      }
    • Replace <dremio_trust_account> with your Dremio Trust Account ID.

    • Click Settings in the side navigation bar and choose Project settings to copy your Project ID. Replace <project_id> with your Project ID.

  5. Click Next to go to the Add permissions page. No edits are needed on this page.

  6. Click Next to go to the Name, review, and create page.

  7. In the Role details section, in the Role name field, enter a name for this role.

  8. Click Create role.

Add an S3 Access Policy to a Custom Role

To add the required S3 access policy to your custom role:

  1. On the Roles page, click the role name. Use the Search field to locate the role if needed.
  2. From the Roles page, in the Permissions section, click Add permissions > Create inline policy.
  3. On the Create policy page, click the JSON tab.
  4. Delete the current JSON policy and copy the IAM Policy Template for S3. Replace <bucket-name> with the name of your S3 bucket.
  5. Click Next.
  6. On the Review policy page, in the Name field, enter a name for the policy.
  7. Click Create policy. The policy is created and you are returned to the Roles page.