On this page

    AWS Glue Data Catalog

    Dremio supports Amazon S3 datasets cataloged in AWS Glue as a Dremio data source.

    Note:
    S3 files must be one of the following formats:

    • Parquet
    • ORC
    • Delimited text files (CSV/TSV)

    Amazon S3 and AWS Glue Credentials

    Dremio administrators need credentials to access files in Amazon S3 and list databases and tables in the Glue Catalog. Dremio recommends using the provided sample AWS managed policy when configuring a new Glue Catalog data source. See Dremio Configuration for more information about supported authentication mechanisms.

    AWS IAM Policy for Accessing Amazon S3 and AWS Glue

    Dremio recommends using the following AWS managed policy:

    {
        "Version": "2012-10-17",
        "Statement": [
            # Allow Dremio to run the listed AWS Glue API operations.
            {
                "Effect": "Allow",
                "Action": [
                    "glue:GetDatabase",
                    "glue:GetDatabases",
                    "glue:GetPartition",
                    "glue:GetPartitions",
                    "glue:GetTable",
                    "glue:GetTableVersions",
                    "glue:GetTables",
                    "glue:GetConnection",
                    "glue:GetConnections",
                    "glue:GetDevEndpoint",
                    "glue:GetDevEndpoints",
                    "glue:GetUserDefinedFunction",
                    "glue:GetUserDefinedFunctions",
                    "glue:BatchGetPartition"
                ],
                "Resource": [
                    "*"
                ]
            },
            # Allow Dremio to read and write files in a bucket.
            {
                "Effect": "Allow",
                "Action": [
                    "s3:GetObject",
                    "s3:PutObject"
                ],
                "Resource": [
                    "arn:aws:s3:::aws-glue-*/*",
                    "arn:aws:s3:::*/*aws-glue-*/*"
                ]
            },
            # Allow Dremio to access the Amazon S3 buckets or folders with names containing either the 'aws-glue-' or 'crawler-public' prefixes.
            {
                "Effect": "Allow",
                "Action": [
                    "s3:GetObject"
                ],
                "Resource": [
                    "arn:aws:s3:::crawler-public*",
                    "arn:aws:s3:::aws-glue-*"
                ]
            },
            # Allow Dremio to create or delete tags in the Glue catalog.
            {
                "Effect": "Allow",
                "Action": [
                    "ec2:CreateTags",
                    "ec2:DeleteTags"
                ],
                "Condition": {
                    "ForAllValues:StringEquals": {
                        "aws:TagKeys": [
                            "aws-glue-service-resource"
                        ]
                    }
                },
                "Resource": [
                    "arn:aws:ec2:*:*:network-interface/*",
                    "arn:aws:ec2:*:*:security-group/*",
                    "arn:aws:ec2:*:*:instance/*"
                ]
            }
        ]
    }
    

    Dremio Configuration

    Dremio administrators are responsible for the following Dremio configuration tasks:

    • Configure Dremio access to AWS Glue Catalog and Amazon S3 datasets
    • Verify default settings for asynchronous access and local caching
    • Verify or update refresh policies for Data Reflections and metadata
    • Specify which Dremio users have edit access to the AWS Glue data source

    Administrators can later access these settings and update the initial configuration by editing the data source.

    General

    Dremio administrators configuring access to AWS Glue Catalog and Amazon S3 datasets specify one of three authentication methods.

    Authentication

    • AWS Access Key method – All or whitelisted (if specified) buckets associated with this access key or IAM role to assume, if provided, will be available.
    • EC2 Metadata method – All or whitelisted (if specified) buckets associated with the specified IAM role attached to EC2 or IAM role to assume, if provided, will be available.
    • AWS Profile – Dremio sources profile credentials from the specified AWS profile. For information on how to set up a configuration or credentials file for AWS, see AWS Custom Authentication.
      • Profile Name (Optional) – The AWS profile name. If this is left blank, then the default profile will be used. For more information about using profiles in a credentials or configuration file, see AWS’s documentation on Configuration and credential file settings.
    • No Authentication – Only the buckets provided in Public Buckets will be available.

    Note:

    Dremio encrypts connections to AWS Glue by default.

    Advanced Options

    Dremio enables asynchronous access and local caching when possible by default on the Advanced Options modal.

    Lake Formation Integration

    Lake Formation provides access controls and allows administrators to define security policies. Enabling this functionality and additional details on the configuration options below are described in more detail on the Integrating with Lake Formation page.

    • Enforce AWS Lake Formation access permissions on datasets. Any datasets included in this source will be checked against by Dremio for required permissions to perform queries.
    • Prefix to map Dremio users to AWS ARNs. Leave blank to default to the end user’s username, or enter a REGEX expression.
    • Prefix to map Dremio groups to AWS ARNs. Leave blank to default to the end user’s group, or enter a REGEX expression.

    Reflection Refresh

    Specify how frequently Dremio refreshes Data Reflections based on the Glue data source in the Reflection Refresh tab. Dremio refreshes every hour and expires after three hours by default.

    Metadata

    Specify how and how frequently Dremio refreshes metadata on the Metadata tab. By default, Dremio fetches top-level objects and dataset details every hour. Dremio retrieves details only for queried datasets by default to improve query performance.

    Sharing

    Specify which users have edit access to the data source in the Sharing tab. Dremio allows all users to edit the data source by default.