AWS Glue
Dremio supports S3 datasets cataloged in AWS Glue as a Dremio data source.
Note: S3 files must be one of the following formats:
- Parquet
- ORC
- Delimited text files (CSV/TSV)
AWS S3 and Glue Credentials
Dremio administrators need credentials to access files in AWS S3 and list databases and tables in Glue Catalog. Dremio recommends using the provided sample AWS managed policy when configuring a new Glue Catalog data source. See Dremio Configuration for more information about supported authentication mechanisms.
Sample AWS IAM Policy for Accessing AWS S3 and Glue
Dremio recommends using the following AWS managed policy:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"glue:GetDatabase",
"glue:GetDatabases",
"glue:GetPartition",
"glue:GetPartitions",
"glue:GetTable",
"glue:GetTableVersions",
"glue:GetTables",
"glue:GetConnection",
"glue:GetConnections",
"glue:GetDevEndpoint",
"glue:GetDevEndpoints",
"glue:GetUserDefinedFunction",
"glue:GetUserDefinedFunctions",
"glue:BatchGetPartition"
],
"Resource": [
"*"
]
},
{
"Effect": "Allow",
"Action": [
"s3:CreateBucket"
],
"Resource": [
"arn:aws:s3:::aws-glue-*"
]
},
{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:PutObject",
"s3:DeleteObject"
],
"Resource": [
"arn:aws:s3:::aws-glue-*/*",
"arn:aws:s3:::*/*aws-glue-*/*"
]
},
{
"Effect": "Allow",
"Action": [
"s3:GetObject"
],
"Resource": [
"arn:aws:s3:::crawler-public*",
"arn:aws:s3:::aws-glue-*"
]
},
{
"Effect": "Allow",
"Action": [
"logs:CreateLogGroup",
"logs:CreateLogStream",
"logs:PutLogEvents"
],
"Resource": [
"arn:aws:logs:*:*:/aws-glue/*"
]
},
{
"Effect": "Allow",
"Action": [
"ec2:CreateTags",
"ec2:DeleteTags"
],
"Condition": {
"ForAllValues:StringEquals": {
"aws:TagKeys": [
"aws-glue-service-resource"
]
}
},
"Resource": [
"arn:aws:ec2:*:*:network-interface/*",
"arn:aws:ec2:*:*:security-group/*",
"arn:aws:ec2:*:*:instance/*"
]
}
]
}
Dremio Configuration
Dremio administrators are responsible for the following Dremio configuration tasks:
- Configure Dremio access to AWS Glue Catalog and AWS S3 datasets
- Verify default settings for asynchronous access and local caching
- Verify or update refresh policies for Data Reflections and metadata
- Specify which Dremio users have edit access to the AWS Glue data source
Administrators can later access these settings and update the initial configuration by editing the data source.
General
Dremio administrators configuring access to AWS Glue Catalog and AWS S3 datasets specify one of three authentication methods:
- AWS Access Key method -- All or whitelisted (if specified) buckets associated with this access key or IAM role to assume, if provided, will be available.
- AWS Access Key ID
- AWS Access Secret
- IAM Role -- Dremio assumes this role in conjunction with AWS Access Key method.
- EC2 Metadata method -- All or whitelisted (if specified) buckets associated with the specified IAM role attached to EC2 or IAM role to assume, if provided, will be available.
- No Authentication -- Only the buckets provided in Public Buckets will be available.
[info] Dremio encrypts connections to AWS Glue by default.
Advanced Options
Dremio enables asynchronous access and local caching when possible by default on the Advanced Options
modal:
Reflection Refresh
Specify how frequently Dremio refreshes Data Reflections based on the Glue data source in the Reflection Refresh
tab. Dremio refreshes every hour and expires after three hours by default.
Metadata
Specify how and how frequently Dremio refreshes metadata on the Metadata
tab. By default, Dremio fetches top-level objects and dataset details every hour. Dremio retrieves details only for queried datasets by default to improve query peformance.
Sharing
Specify which users have edit access to the data source in the Sharing
tab. Dremio allows all users to edit the data source by default.