Amazon S3

This topic provides information for configuring the Amazon S3 data source.

Working with files stored in S3

You can query files and directories stored in your S3 buckets. Dremio supports a number of different file formats. See Files and Directories for more information.

Amazon Configuration

Amazon configuration involves:

  • Providing AWS credentials
  • Providing IAM Policy requirements

Amazon S3 Credentials

To list your AWS account's S3 buckets as a source, you must provide your AWS credentials in the form of your access and secret keys. You can find instructions for creating these keys in Amazon's Access Key documentation.

[info] Note: AWS credentials are not necessary if you are accessing only public S3 buckets.

Sample IAM Policy for Accessing S3

The following sample IAM Policy show the minimum policy requirements that allows Dremio to read and query S3.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "Stmt1554422960000",
            "Effect": "Allow",
            "Action": [
                "s3:GetBucketLocation",
                "s3:ListAllMyBuckets"
            ],
            "Resource": [
                "arn:aws:s3:::*"
            ]
        },
        {
            "Sid": "Stmt1554423012000",
            "Effect": "Allow",
            "Action": [
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::BUCKET-NAME"
            ]
        },
        {
            "Sid": "Stmt1554423050000",
            "Effect": "Allow",
            "Action": [
                "s3:GetObject"
            ],
            "Resource": [
                "arn:aws:s3:::BUCKET-NAME/*"
            ]
        }
    ]
}

Sample IAM Policy for Writing to S3

The following sample IAM Policy shows the minimum policy requirements that allows Dremio to write to S3.
For example, to store reflections on S3.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": [
                "s3:PutObject",
                "s3:GetObject",
                "s3:ListBucket",
                "s3:DeleteObject"
            ],
            "Resource": [
                "arn:aws:s3:::BUCKET-NAME",
                "arn:aws:s3:::BUCKET-NAME/*"
            ]
        },
        {
            "Sid": "VisualEditor2",
            "Effect": "Allow",
            "Action": [
                "s3:ListAllMyBuckets",
                "s3:HeadBucket"
            ],
            "Resource": "*"
        }
    ]
}

Dremio Configuration

General

  • Authentication
    • AWS Access Key method -- All or whitelisted (if specified) buckets associated with this access key or IAM role to assume (if specified) will be available. See Advanced Options for whitelisted information.
      • AWS Access Key -- AWS access key.
      • AWS Access Secret -- AWS access secret.
      • IAM Role to Assume -- Used in conjunction with AWS Access Key method.
    • EC2 Metadata method -- All or whitelisted (if specified) buckets associated with the IAM role attached to EC2 or IAM role to assume (if specified) will be available. See Advanced Options for whitelisted information.
      • IAM Role to Assume -- Used in conjunction with EX2 Metadata method.
    • No Authentication -- Only the buckets provided in Public Buckets will be available.
    • Encrypt connection -- Enables secure connections.
  • Public Buckets -- A list of external buckets that are not included with the provided AWS account credentials.

Advanced Options

Advanced options include:

  • Enable asynchronous access when possible (default)
  • Enable exports into the source (CTAS and DROP)
  • Enable compatibility mode (experimental)
  • Root Path -- Root path for the source.
  • Connection Properties -- A list of additional connection properties.
  • Whitelisted buckets - A list of buckets to whitelist.
  • Cache Options
    • Enable local caching when possible
    • Max percent of total available cache space to use when possible.

Advanced Options

[warning] WARNING

If your S3 datasets include large Parquet files with 100 or more columns, then you will need to edit the number of maximum connections to S3 that each processing unit of Dremio is allowed to spawn. This can be done by adding a connection property called fs.s3a.connection.maximum and a custom value greater than the default 100.

Connecting through a proxy server

Optionally, you can configure your S3 source to connect through a proxy. You can achieve this by adding the following Properties in the settings for your S3 source:

Property Name Description
fs.s3a.proxy.host Proxy host.
fs.s3a.proxy.port Proxy port number.
fs.s3a.proxy.username Username for authenticated connections, optional.
fs.s3a.proxy.password Password for authenticated connections, optional.

Connecting to a bucket in AWS GovCloud

To connect to a bucket in AWS GovCloud, set the correct GovCloud endpoint for your S3 source. You can achieve this by adding the following Properties in the settings:

Property Name Description
fs.s3a.endpoint e.g. s3-us-gov-west-1.amazonaws.com

Reflection Refresh

  • Never refresh -- Specifies how often to refresh based on hours, days, weeks, or never.
  • Never expire -- Specifies how often to expire based on hours, days, weeks, or never.

Metadata

Dataset Handling

  • Remove dataset definitions if underlying data is unavailable (Default).
    If this box is not checked and the underlying files under a folder are removed or the folder/source is not accessible, Dremio does not remove the dataset definitions. This option is useful in cases when files are temporarily deleted and put back in place with new sets of files.
  • Automatically format files into physical datasets when users issue queries. If this box is checked and a query runs against the un-promoted PDS/folder, Dremio automatically promotes using default options. If you have CSV files, especially with non-default options, it might be useful to not check this box.

Metadata Refresh

  • Dataset Discovery -- Refresh interval for top-level source object names such as names of DBs and tables.
    • Fetch every -- Specify fetch time based on minutes, hours, days, or weeks. Default: 1 hour
  • Dataset Details -- The metadata that Dremio needs for query planning such as information needed for fields, types, shards, statistics, and locality.
    • Fetch mode -- Specify either Only Queried Datasets, All Datasets, or As Needed. Default: Only Queried Datasets
      • Only Queried Datasets -- Dremio updates details for previously queried objects in a source.
        This mode increases query performance because less work is needed at query time for these datasets.
      • All Datasets -- Dremio updates details for all datasets in a source. This mode increases query performance because less work is needed at query time.
      • As Needed -- Dremio updates details for a dataset at query time. This mode minimized metadata queries on a source when not used, but might lead to longer planning times.
    • Fetch every -- Specify fetch time based on minutes, hours, days, or weeks. Default: 1 hour
    • Expire after -- Specify expiration time based on minutes, hours, days, or weeks. Default: 3 hours

Sharing

You can specify which users can edit. Options include:

  • All users can edit.
  • Specific users can edit.

Configuring S3 for Minio

As of Dremio 3.2.3, Minio is offered as an experimental S3-compatible plugin.

To configure your S3 source for Minio in the Dremio UI:

  1. Under Advanced Options, check Enable compatibility mode (experimental).
  2. Under Advanced Options > Connection Properties, add fs.s3a.path.style.access and set the value to true.
    Note: This setting ensure that the request path is created correctly when using IP addresses or hostnames as the endpoint.
  3. Under Advanced Options > Connection Properties, add the fs.s3a.endpoint property and its corresponding server endpoint value (IP address).
    Limitation: The endpoint value cannot contain the http(s):// prefix. For example, if the endpoint is http://123.1.2.3:9000, the value is 123.1.2.3:9000.

To configure your S3 source for Minio with an encrypted connection enabled:

  1. Use OpenSSL to generate a self signed certificate. See Securing Access to Minio Servers or use an existing self signed certificate.
  2. Start up Minio server with ./minio server [data folder] --certs-dir [certs directory].
  3. Install Dremio.
  4. In your client environment where Dremio is located, install the certificate into /jre/lib/security with the following command:
    <JAVA_HOME>/keytool -import -v -trustcacerts -alias alias -file cert-file -keystore cacerts -keypass changeit -storepass changeit Note: Replace alias with the alias name you want and replace cert-file with the absolute path of the certificate file used to startup Minio server.
  5. Startup Dremio.
  6. In the Dremio UI, add and configure an Amazon S3 data source with the Minio plug-in.
    1. Under the General tab, specify the AWS Access Key and AWS Access Secret provided by your Minio server.
    2. Under the General tab, check Encrypt Connection.
    3. Under Advanced Options, check Enable compatibility mode (experimental).
    4. Under Advanced Options > Connection Properties, add fs.s3a.path.style.access and set the value to true.
      Note: This setting ensure that the request path is created correctly when using IP addresses or hostnames as the endpoint.
    5. Under Advanced Options > Connection Properties, add the fs.s3a.endpoint property and its corresponding server endpoint value (IP address).
      Limitation: The endpoint value cannot contain the http(s):// prefix. For example, if the endpoint is http://123.1.2.3:9000, the value is 123.1.2.3:9000.

Configuring Minio as a Distributed Store

Minio can be be used as a distributed store. Note that Minio works as a distributed store for both SSL and unencrypted connections. See Configuring Distributed Storage for more information.

Configuring Cloud Cache

As of Dremio 4.0 Enterprise Edition, cloud caching is available. See Cloud Cache and Configuring Cloud Cache for more information.

Configuring KMS Encryption for Distributed Store

As of Dremio 4.0, AWS Key Managment Service (KMS) is available for S3 distributed store. See Configuring Distributed Storage for more information.

For More Information

See the following Minio documentation for more information:


results matching ""

    No results matching ""