Dremio supports S3 datasets cataloged in AWS Glue as a Dremio data source.
Note: S3 files must be one of the following formats:
- Parquet
- ORC
- Delimited text files (CSV/TSV)
Dremio administrators need credentials to access files in AWS S3 and list databases and tables in Glue Catalog. Dremio recommends using the provided sample AWS managed policy when configuring a new Glue Catalog data source. See Dremio Configuration for more information about supported authentication mechanisms.
Dremio recommends using the following AWS managed policy:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"glue:GetDatabase",
"glue:GetDatabases",
"glue:GetPartition",
"glue:GetPartitions",
"glue:GetTable",
"glue:GetTableVersions",
"glue:GetTables",
"glue:GetConnection",
"glue:GetConnections",
"glue:GetDevEndpoint",
"glue:GetDevEndpoints",
"glue:GetUserDefinedFunction",
"glue:GetUserDefinedFunctions",
"glue:BatchGetPartition"
],
"Resource": [
"*"
]
},
{
"Effect": "Allow",
"Action": [
"s3:CreateBucket"
],
"Resource": [
"arn:aws:s3:::aws-glue-*"
]
},
{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:PutObject",
"s3:DeleteObject"
],
"Resource": [
"arn:aws:s3:::aws-glue-*/*",
"arn:aws:s3:::*/*aws-glue-*/*"
]
},
{
"Effect": "Allow",
"Action": [
"s3:GetObject"
],
"Resource": [
"arn:aws:s3:::crawler-public*",
"arn:aws:s3:::aws-glue-*"
]
},
{
"Effect": "Allow",
"Action": [
"logs:CreateLogGroup",
"logs:CreateLogStream",
"logs:PutLogEvents"
],
"Resource": [
"arn:aws:logs:*:*:/aws-glue/*"
]
},
{
"Effect": "Allow",
"Action": [
"ec2:CreateTags",
"ec2:DeleteTags"
],
"Condition": {
"ForAllValues:StringEquals": {
"aws:TagKeys": [
"aws-glue-service-resource"
]
}
},
"Resource": [
"arn:aws:ec2:*:*:network-interface/*",
"arn:aws:ec2:*:*:security-group/*",
"arn:aws:ec2:*:*:instance/*"
]
}
]
}
Dremio administrators are responsible for the following Dremio configuration tasks:
Administrators can later access these settings and update the initial configuration by editing the data source.
Dremio administrators configuring access to AWS Glue Catalog and AWS S3 datasets specify one of three authentication methods:
Dremio encrypts connections to AWS Glue by default.
Dremio enables asynchronous access and local caching when possible by default on the Advanced Options
modal:
Specify how frequently Dremio refreshes Data Reflections based on the Glue data source in the Reflection Refresh
tab. Dremio refreshes every hour and expires after three hours by default.
Specify how and how frequently Dremio refreshes metadata on the Metadata
tab. By default, Dremio fetches top-level objects and dataset details every hour. Dremio retrieves details only for queried datasets by default to improve query peformance.
Specify which users have edit access to the data source in the Sharing
tab. Dremio allows all users to edit the data source by default.