AWS Glue

Dremio supports S3 datasets cataloged in AWS Glue as a Dremio data source.

Note: S3 files must be one of the following formats:

  • Parquet
  • ORC
  • Delimited text files (CSV/TSV)

AWS S3 and Glue Credentials

Dremio administrators need credentials to access files in AWS S3 and list databases and tables in Glue Catalog. Dremio recommends using the provided sample AWS managed policy when configuring a new Glue Catalog data source. See Dremio Configuration for more information about supported authentication mechanisms.

Sample AWS IAM Policy for Accessing AWS S3 and Glue

Dremio recommends using the following AWS managed policy:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "glue:GetDatabase",
                "glue:GetDatabases",
                "glue:GetPartition",
                "glue:GetPartitions",
                "glue:GetTable",
                "glue:GetTableVersions",
                "glue:GetTables",
                "glue:GetConnection",
                "glue:GetConnections",
                "glue:GetDevEndpoint",
                "glue:GetDevEndpoints",
                "glue:GetUserDefinedFunction",
                "glue:GetUserDefinedFunctions",
                "glue:BatchGetPartition"
            ],
            "Resource": [
                "*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:CreateBucket"
            ],
            "Resource": [
                "arn:aws:s3:::aws-glue-*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:PutObject",
                "s3:DeleteObject"
            ],
            "Resource": [
                "arn:aws:s3:::aws-glue-*/*",
                "arn:aws:s3:::*/*aws-glue-*/*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject"
            ],
            "Resource": [
                "arn:aws:s3:::crawler-public*",
                "arn:aws:s3:::aws-glue-*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "logs:CreateLogGroup",
                "logs:CreateLogStream",
                "logs:PutLogEvents"
            ],
            "Resource": [
                "arn:aws:logs:*:*:/aws-glue/*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "ec2:CreateTags",
                "ec2:DeleteTags"
            ],
            "Condition": {
                "ForAllValues:StringEquals": {
                    "aws:TagKeys": [
                        "aws-glue-service-resource"
                    ]
                }
            },
            "Resource": [
                "arn:aws:ec2:*:*:network-interface/*",
                "arn:aws:ec2:*:*:security-group/*",
                "arn:aws:ec2:*:*:instance/*"
            ]
        }
    ]
}

Dremio Configuration

Dremio administrators are responsible for the following Dremio configuration tasks:

  • Configure Dremio access to AWS Glue Catalog and AWS S3 datasets
  • Verify default settings for asynchronous access and local caching
  • Verify or update refresh policies for Data Reflections and metadata
  • Specify which Dremio users have edit access to the AWS Glue data source

Administrators can later access these settings and update the initial configuration by editing the data source.

General

Dremio administrators configuring access to AWS Glue Catalog and AWS S3 datasets specify one of three authentication methods:

  • AWS Access Key method -- All or whitelisted (if specified) buckets associated with this access key or IAM role to assume, if provided, will be available.
  • EC2 Metadata method -- All or whitelisted (if specified) buckets associated with the specified IAM role attached to EC2 or IAM role to assume, if provided, will be available.
  • No Authentication -- Only the buckets provided in Public Buckets will be available.

[info] Dremio encrypts connections to AWS Glue by default.

Advanced Options

Dremio enables asynchronous access and local caching when possible by default on the Advanced Options modal:

Reflection Refresh

Specify how frequently Dremio refreshes Data Reflections based on the Glue data source in the Reflection Refresh tab. Dremio refreshes every hour and expires after three hours by default.

Metadata

Specify how and how frequently Dremio refreshes metadata on the Metadata tab. By default, Dremio fetches top-level objects and dataset details every hour. Dremio retrieves details only for queried datasets by default to improve query peformance.

Sharing

Specify which users have edit access to the data source in the Sharing tab. Dremio allows all users to edit the data source by default.


results matching ""

    No results matching ""