Amazon S3
Amazon S3 is an object storage service from AWS.
Supported Formats
Dremio can query data stored in S3 in file formats (including delimited, Excel (XLSX), JSON, and Parquet) and table formats (including Apache Iceberg and Delta Lake).
Add an Amazon S3 Source
To add an S3 source:
-
From the Datasets page, next to Sources, click
. -
In the Add Data Source dialog, under Object Storage, click Amazon S3.
General
To configure an S3 source:
- Name – In the Name field, enter a name for the Amazon S3 source. The name cannot include the following special characters:
/,:,[, or]. - Authentication – Provide the role that Dremio will assume to gain access to the source:
- Create an AWS IAM role in your AWS account that trusts Dremio.
- Add an S3 Access Policy to your custom role that provides access to your S3 source.
- Add the Role ARN to the source configuration.
- Public Buckets – (Optional) Click Add bucket and enter the public S3 bucket URL. You can add multiple public S3 buckets. AWS credentials are not necessary if you are accessing only public S3 buckets.
- Encrypt Connection – (Optional) To secure the connections between the S3 buckets and Dremio, select the Encrypt connection checkbox.
Advanced Options
Click Advanced Options in the left menu sidebar.
- Apply requester-pays to S3 requests – The requester (instead of the bucket owner) pays the cost of the S3 request and the data downloaded from the S3 bucket.
- Enable file status check – Enabled by default; uncheck the box to disable. Enables Dremio to check if a file exists in the S3 bucket before proceeding to handle errors gracefully. Disable this option when there are no files missing from the S3 bucket or when the file's access permissions have not changed. Disabling this option reduces the amount of communication to the S3 bucket.
- Root Path – The root path for the Amazon S3 bucket. The default root path is /.
- VPC-restricted S3 buckets are not supported.
- Server-side encryption key ARN – Add the ARN key created in AWS Key Management Service (KMS) if you want to store passwords in AWS KMS. Ensure that the AWS credentials you share with Dremio have access to this ARN key.
- Default CTAS Format – Choose the default format for tables you create in Dremio: either Parquet or Iceberg (default).
- Connection Properties – Provide custom key-value pairs for the connection relevant to the source. Click Add Property. For Name, enter a connection property. For Value, enter the corresponding connection property value.
- Allowlisted buckets – Add an approved S3 bucket in the text field. You can add multiple S3 buckets. When using this option to add specific S3 buckets, you will only be able to see those buckets and not all the buckets that may be available in the source. Buckets entered must be valid. Misspelled or nonexistent buckets will not appear in the resulting source.
Under Cache Options:
- Enable local caching when possible – Selected by default, along with asynchronous access for cloud caching. Uncheck the checkbox to disable this option.
- Max percent of total available cache space to use when possible – Specifies the disk quota, as a percentage, that a source can use on any single executor node only when local caching is enabled. The default is 100 percent of the total disk space available on the mount point provided for caching. You can either manually enter a percentage in the value field or use the arrows to the far right to adjust the percentage.
Reflection Refresh
Click Reflection Refresh in the left menu sidebar. This section allows you to manage how often reflections are refreshed and how long data can be served before expiration. To learn more about reflections, refer to Manual Reflections. All settings are optional.
You can set the following refresh policies for reflections:
- Refresh period – Manage the refresh period by either enabling the option to never refresh or setting a refresh frequency in hours, days, or weeks. The default frequency to refresh reflections is every hour.
- Expiration period – Set the expiration period for the length of time that data can be served by either enabling the option to never expire or setting an expiration time in hours, days, or weeks. The default expiration time is three hours.
Metadata
Click Metadata in the left menu sidebar. This section allows you to configure settings to refresh metadata and enable other dataset options.
You can configure Dataset Handling and Metadata Refresh parameters.
Dataset Handling
Select from the following options. All settings are optional.
- Remove dataset definitions if underlying data is unavailable – By default, Dremio removes dataset definitions if underlying data is unavailable. This option is for scenarios when files are temporarily deleted and added back in the same location with new sets of files.
- Automatically format files into tables when users issue queries – Enable this option to allow Dremio to automatically format files into tables when you run queries. This option is for scenarios when the data contains CSV files with non-default options.
Metadata Refresh
The Metadata Refresh parameters include Dataset Discovery and Dataset Details.
- Dataset Discovery – The refresh interval for fetching top-level source object names such as databases and tables. Use this parameter to set the time interval. You can choose to set the frequency to fetch object names in minutes, hours, days, or weeks. The default frequency to fetch object names is one hour.
- Dataset Details – The metadata that Dremio needs for query planning, such as information required for fields, types, shards, statistics, and locality. The following describes the parameters that fetch the dataset information.
- Fetch mode – You can choose to fetch only from queried datasets, which is set by default. Dremio updates details for previously queried objects in a source. Fetching from all datasets is deprecated.
- Fetch every – You can choose to set the frequency to fetch dataset details in minutes, hours, days, or weeks. The default frequency to fetch dataset details is one hour.
- Expire after – You can choose to set the expiry time of dataset details in minutes, hours, days, or weeks. The default expiry time of dataset details is three hours.
Privileges
Click Privileges in the left menu sidebar. This section allows you to grant privileges to specific users or roles. To learn more about how Dremio allows for the implementation of granular-level privileges, see Privileges. All settings are optional.
To add a privilege for a user or role:
-
In the Add User/Role field, enter the user or role name to which you want to apply privileges.
-
Click Add to Privileges. The user or role is added to the Users table.
To set privileges for a user or role:
-
In the Users table, identify the user for which you want to set privileges and click under the appropriate column (Select, Alter, Create Table, etc.) to either enable or disable that privilege. A green checkmark indicates that the privilege is enabled.
-
Click Save.
Edit an Amazon S3 Source
To edit an S3 source:
-
From the Datasets page, right-click on the source to edit and select Settings.
-
In the Edit Source dialog box, make changes as needed. For information about the settings in each category, see Add an Amazon S3 Source.
-
Click Save.
Remove an Amazon S3 Source
To remove an S3 source:
-
From the Datasets page, right-click on the source to be removed and select Delete.
-
Confirm that you want to remove the source.
Create an AWS IAM Role
To create an AWS IAM role that provides Dremio with access to your source:
-
Sign in to the AWS Identity and Access Management (IAM) console.
-
From the left menu pane, under Access management, select Roles.
-
On the Roles page, click Create role.
-
On the Select trusted entity page:
-
Under Trusted entity type, select the radio button for Custom Trust Policy.
-
Delete the current JSON policy and paste in the custom trust policy template:
Custom Trust Policy Template{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "AllowAssumeRoleWithExternalId",
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::<dremio_trust_account>:root"
},
"Action": "sts:AssumeRole",
"Condition": {
"StringEquals": {
"sts:ExternalId": "<project_id>"
}
}
},
{
"Sid": "AllowTagSessionFromCallerRole",
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::<dremio_trust_account>:root"
},
"Action": "sts:TagSession"
}
]
} -
Replace
<dremio_trust_account>with your Dremio Trust Account ID. -
Click
in the side navigation bar and choose Project settings to copy your Project ID. Replace <project_id>with your Project ID.
-
-
Click Next to go to the Add permissions page. No edits are needed on this page.
-
Click Next to go to the Name, review, and create page.
-
In the Role details section, in the Role name field, enter a name for this role.
-
Click Create role.
Add an S3 Access Policy to a Custom Role
To add the required S3 access policy to your custom role:
-
On the Roles page, click the role name. Use the Search field to locate the role if needed.
-
From the Roles page, in the Permissions section, click Add permissions > Create inline policy.
-
On the Create policy page, click the JSON tab.
-
Delete the current JSON policy and copy the IAM Policy Template for S3. Replace
<bucket-name>with the name of your S3 bucket. -
Click Next.
-
On the Review policy page, in the Name field, enter a name for the policy.
-
Click Create policy. The policy is created and you are returned to the Roles page.