Amazon S3
Amazon S3 is an object storage service that stores and retrieves large amounts of data. In order to connect to Amazon S3 as a source, you must configure both your AWS and Dremio accounts.
If you are connecting Dremio to public S3 buckets, then AWS credentials are not required. In this case, you can skip Configuring Your AWS Account and proceed directly to Adding an S3 Source.
Supported Formats
Dremio can query the data stored in S3 in file formats (including delimited, Excel (XLSX), JSON, and Parquet) and table formats (including Apache Iceberg and Delta Lake).
Configuring Your AWS Account
To allow Dremio to access the data in your Amazon S3 source, you must configure your AWS account using one of the following authentication methods: project data credentials or data source credentials. Additionally, to enable Dremio to read and query the S3 source and/or write to it, you need to modify IAM permissions during configuration by attaching the appropriate IAM policies. These IAM policy templates are provided after the authentication instructions.
Review each authentication method below and choose the one that best meets your needs.
Authentication Using Project Data Credentials
Write access to a bucket is required if you want to use Dremio to do CRUD with Apache Iceberg or do CTAS.
Use project data credentials to enable Dremio to access Amazon S3 using the IAM role that is associated with your Dremio project. This IAM role was created when you signed up for Dremio and is the default credential that is used to access all the sources in your project.
For this option, you will attach the necessary IAM policies to your Dremio project's IAM role. The following policies are available:
- Enable Dremio to read and query the S3 source.
- Enable Dremio to write to the S3 source.
For instructions to set up these policies and attach them to an IAM role, see Set up AWS IAM Permissions.
Authentication Using Data Source Credentials
Use data source credentials to enable Dremio to access Amazon S3 using either a source-specific access key or an IAM role. This method provides you the flexibility to assign either an access key or an IAM role to each source that is in your Amazon S3 account.
The following IAM policies are available:
- Enable Dremio to read and query the S3 source.
- Enable Dremio to write to the S3 source.
Choose one of the following authentication methods to access the data source. During this set up, you will attach the IAM policies to provide Dremio read and/or write access to the data source.
- Use an access key: Create an IAM user in your AWS account. The access key is generated during the set up process. For the steps, see Creating an IAM role.
- Use a new IAM role: For the steps to create a new role, see Creating an IAM user.
- Use the Dremio project IAM role: To attach these policy templates to the Dremio project's IAM role, see Set up AWS IAM Permissions.
IAM Policy Template for Read and Query Access to S3
The following IAM policy template contains the minimum policy requirements to allow Dremio to read and query your S3 source. You can copy this policy and use it when you configure your AWS account using one of the authentication methods.
Make the following edit before using the policy in your IAM console.
-
Replace
<bucket-name>
with the name of your S3 bucket. -
Remove all the comments contained in the IAM policy template
noteTo add multiple buckets, edit the
Resource
attribute value by adding an array of S3 buckets. For example,"Resource" : ["arn:aws:s3:::<bucket-name1>", "arn:aws:s3:::<bucket-name2>"]
.
{
"Version": "2012-10-17",
"Statement": [
# Allow Dremio to enumerate S3 buckets and their locations within the account.
{
"Sid": "Stmt1554422960000",
"Effect": "Allow",
"Action": [
"s3:GetBucketLocation",
"s3:ListAllMyBuckets"
],
"Resource": [
"arn:aws:s3:::*"
]
},
# Allow Dremio Cloud to list the content of the Project Store bucket.
{
"Sid": "Stmt1554423012000",
"Effect": "Allow",
"Action": [
"s3:ListBucket"
],
"Resource": [
"arn:aws:s3:::<bucket-name>"
]
},
# Allow Dremio Cloud to retrieve a file from the Project Store bucket.
{
"Sid": "Stmt1554423050000",
"Effect": "Allow",
"Action": [
"s3:GetObject"
],
"Resource": [
"arn:aws:s3:::<bucket-name>/*"
]
}
]
}
IAM Policy Template for Write Access to S3
The following IAM policy template includes the minimum policy requirements to allow Dremio to write to S3 (for example, to create tables). You can copy this policy and use it when you configure your AWS account using one of the authentication methods.
Make the following edit before using the policy in your IAM console.
-
Replace
<bucket-name>
with the name of your S3 bucket. -
Remove all the comments contained in the IAM policy template
noteTo add multiple buckets, edit the
Resource
attribute value by adding an array of S3 buckets. For example,"Resource" : ["arn:aws:s3:::<bucket-name1>", "arn:aws:s3:::<bucket-name2>"]
.
{
"Version": "2012-10-17",
"Statement": [
# Allow Dremio Cloud to read and write files in a bucket.
{
"Sid": "VisualEditor0",
"Effect": "Allow",
"Action": [
"s3:PutObject",
"s3:GetObject",
"s3:ListBucket",
"s3:DeleteObject"
],
"Resource": [
"arn:aws:s3:::<bucket-name>",
"arn:aws:s3:::<bucket-name>/*"
]
},
# Allow Dremio Cloud to list the buckets.
{
"Sid": "VisualEditor2",
"Effect": "Allow",
"Action": [
"s3:ListAllMyBuckets",
],
"Resource": "*"
}
]
}
Adding an S3 Source
To add an S3 source to your project:
- From the Datasets page, in the Data Lakes section, click (+) Add Source.
Alternatively, click Data Lakes to display all data lake sources. From the top-right of the page, click the Add Data Lake button.
- In the Add Data Source dialog, under Object Storage, click Amazon S3.
The New Amazon S3 Source dialog box appears, which contains the following sections:
- General settings: required fields to configure an S3 source
- Optional settings: Advanced Options, Refraction Refresh, Metadata, Privileges
Refer to the following for guidance on how to complete each section.
Sources containing a large number of files or tables may take longer to be added. During this time, the source name is grayed out and shows a spinner icon, indicating the source is being added. Once complete, the source becomes accessible.
General
To configure an S3 source:
- In the Name field, enter a name for the Amazon S3 source you are connecting to. The name cannot include the following special characters:
/
,:
,[
, or]
. - Under Authentication, select your preferred authentication method:
- Choose Project Data Credentials if you prefer to use the default credentials that allows access to all sources in your project. These credentials were added when you originally signed up for Dremio and created your project. For set up instructions using this method, see Project Data Credentials with Access Key/IAM Role.
- Choose Data Source Credentials if you prefer to use credentials to access a specific source. These credentials are created as part of the source configuration set up. You can choose to set up these credentials using either an access key or with an IAM role.
- To set up the credentials using an access key, see Data Source Credentials with Access Key
- To set up the credentials using an IAM role, see Data Source Credentials with IAM Role
Optionally, you can set up secure connections to public S3 buckets.
- (Optional) Under Public Buckets, click Add bucket and enter the public S3 bucket URL. You can add multiple public S3 buckets.
- (Optional) To secure the connections between the S3 buckets and Dremio, tick the Encrypt connection checkbox.
Advanced Options
Click Advanced Options in the left menu sidebar.
All advanced options are optional.
Review each option provided in the following table to set up the advanced options to meet your needs.
Advanced Option | Description |
---|---|
Enable asynchronous access when possible | Activated by default, uncheck the box to deactivate. Enables cloud caching for the S3 bucket to support simultaneous actions such as adding and editing a new S3 source. |
Enable compatibility mode | Enables you to use S3-compatible storage, such as MinIO, as an S3 source. |
Apply requester-pays to S3 requests | The requester (instead of the bucket owner) pays the cost of the S3 request and the data downloaded from the S3 bucket. |
Enable file status check | Activated by default, uncheck the box to deactivate. Enables Dremio to check if a file exists in the S3 bucket before proceeding to handle errors gracefully. Disable this option when there are no files missing from the S3 bucket or when the file’s access permissions have not changed. Disabling this option reduces the amount of communication to the S3 bucket. |
Root Path | The root path for the Amazon S3 bucket. The default root path is /. |
Server side encryption key ARN | Add the ARN key created in AWS Key Management Service (KMS) if you want to store passwords in AWS KMS. Ensure that the AWS credentials that you share with Dremio have access to this ARN key. |
Default CTAS Format | Choose the default format for tables you create in Dremio, either Parquet or Iceberg. |
Connection Properties | Provide the custom key value pairs for the connection relevant to the source.
|
Allowlisted buckets | Add an approved S3 bucket in the text field. You can add multiple S3 buckets this way. When using this option to add specific S3 buckets, you will only be able to see those buckets and not all the buckets that may be available in the source. Buckets entered must be valid. Misspelled or non-existent buckets will not appear in the resulting source. |
To configure your S3 source to use server-side encryption based on a provided key (SSE-C) or KMS (SSE-KMS), set the following connection properties:
- SSE-C
fs.s3a.server-side-encryption-algorithm
set toSSE-C
fs.s3a.server-side-encryption.key
set to the key used on the objects in S3
- SSE-KMS
fs.s3a.server-side-encryption-algorithm
set toSSE-KMS
fs.s3a.server-side-encryption.key
set to the ARN used on the objects in S3
Under Cache Options, review the following table and edit the options to meet your needs.
Cache Options | Description |
---|---|
Enable local caching when possible | Selected by default, along with asynchronous access for cloud caching. Uncheck the checkbox to disable this option. For more information about local caching, see Columnar Cloud Cache. |
Max percent of total available cache space to use when possible | Specifies the disk quota, as a percentage, that a source can use on any single executor node only when local caching is enabled. The default is 100 percent of the total disk space available on the mount point provided for caching. You can either manually enter in a percentage in the value field or use the arrows to the far right to adjust the percentage. |
If your S3 datasets include large Parquet files with 100 or more columns, then you must edit the number of maximum connections to S3 that each processing unit of Dremio is allowed to spawn. To change the maximum connections:
- Under Connection Properties, click Add Property.
- For Name, enter
fs.s3a.connection.maximum
. - For Value, enter a custom value greater than the default 100.
Reflection Refresh
Click Reflection Refresh in the left menu sidebar. This section lets you manage how often reflections are refreshed and how long data can be served before expiration. To learn more about reflections, refer to Accelerating Queries with Reflections.
All reflection parameters are optional.
You can set the following refresh policies for reflections:
- Refresh period: Manage the refresh period by either enabling the option to never refresh or setting a refresh frequency in hours, days, or weeks. The default frequency to refresh reflections is every hour.
- Expiration period: Set the expiration period for the length of time that data can be served by either enabling the option to never expire or setting an expiration time in hours, days, or weeks. The default expiration time is set to three hours.
Metadata
Click Metadata in the left menu sidebar. This section lets you configure settings to refresh metadata and enable other dataset options.
All metadata parameters are optional.
You can configure Dataset Handling and Metadata Refresh parameters.
Dataset Handling
You can review each option provided in the following table to set up the dataset handling options to meet your needs.
Parameter | Description |
---|---|
Remove dataset definitions if underlying data is unavailable | By default, Dremio removes dataset definitions if underlying data is unavailable. This option is for scenarios when files are temporarily deleted and added back in the same location with new sets of files. |
Automatically format files into tables when users issue queries | Enable this option to allow Dremio to automatically format files into tables when you run queries. This option is for scenarios when the data contains CSV files with non-default options. |
Metadata Refresh
The Metadata Refresh parameters include Dataset Discovery and Dataset Details.
-
Dataset Discovery: The refresh interval for fetching top-level source object names such as databases and tables. Use this parameter to set the time interval.
You can choose to set the frequency to fetch object names in minutes, hours, days, or weeks. The default frequency to fetch object names is one hour. -
Dataset Details: The metadata that Dremio needs for query planning such as information required for fields, types, shards, statistics, and locality. The following table describes the parameters that fetch the dataset information.
Parameter Description Fetch mode You can choose to fetch only from queried datasets that are set by default. Dremio updates details for previously queried objects in a source. Fetching from all datasets is deprecated. Fetch every You can choose to set the frequency to fetch dataset details in minutes, hours, days, or weeks. The default frequency to fetch dataset details is one hour. Expire after You can choose to set the expiry time of dataset details in minutes, hours, days, or weeks. The default expiry time of dataset details is three hours.
Privileges
Click Privileges in the left menu sidebar. This section lets you grant privileges to specific users or roles. To learn more about how Dremio allows for the implementation of granular-level privileges, see Privileges.
All privileges parameters are optional.
To add a privilege for a user or to a role:
- In the Add User/Role field, enter the user or role name that you want to apply privileges to.
- Click Add to Privileges. The user or role is added to the Users table.
To set privileges for a user or to a role:
- In the Users table, identify the user to set privileges for and click under the appropriate column (Select, Alter, Create Table, etc.) to either enable or disable that privilege. A green checkmark indicates that the privilege is enabled.
- Click Save.
Editing an S3 Source
Once an S3 source has been created, you can make edits to it, when needed. To edit an S3 source:
- From the Datasets page, on the bottom-left of the page, click Data Lakes. A list of data lakes displays.
- In the All Data Lakes section, under the Actions column, hover over a data lake to display the hidden Settings (gear) icon, and click the icon > Edit Details.
Alternatively, you can click the name of the data lake, and, from the resulting data lake page, on the upper-right of the page, click the Settings (gear) icon.
- In the Edit Source dialog box, you can make changes in the General settings, updating the Authentication credentials or the connections to public S3 buckets, as needed.
You cannot change the name of the Amazon S3 source.
You can also make changes to any of the optional settings, including Advanced Options, Reflection Refresh, Metadata, and Privileges. For information about these settings and guidance on the changes you can make, see Adding an S3 Source.
- Click Save.
Removing an S3 Source
To remove an S3 source:
- From the Datasets page, on the bottom-left of the page, click Data Lakes. A list of data lakes displays.
- In the All Data Lakes section, under the Actions column, hover over a data lake to display the hidden Settings (gear) icon, and click the icon > Remove Source.
- Confirm that you want to remove the source.
Sources containing a large number of files or tables may take longer to be removed. During this time, the source name is grayed out and shows a spinner icon, indicating the source is being removed. Once complete, the source disappears.
- You cannot set connection properties for a proxy server when you configure the Amazon S3 bucket.
- VPC-restricted S3 buckets are not supported.