AWS Glue Data Catalog
Dremio supports Amazon S3 datasets cataloged in AWS Glue as a Dremio data source. Files in S3 must be one of the following file formats or table formats:
- Apache Iceberg
- Delimited text files (CSV/TSV)
- Delta Lake (Dremio supports reading Native Delta Lake tables in AWS Glue. Delta Lake symlink tables must be crawled and native Delta Lake tables created from them. See Introducing native Delta Lake table support with AWS Glue crawlers in the AWS Big Data blog.)
- ORC
- Parquet
AWS Glue data sources added to projects default to using the Apache Iceberg table format. When upgrading, AWS Glue data sources added to projects before Dremio 22 are modified to use the Apache Iceberg table format as the default format.
AWS Glue Credentials
Dremio administrators need credentials to access files in Amazon S3 and list databases and tables in the AWS Glue Catalog. Dremio recommends using the provided sample AWS managed policy when configuring a new AWS Glue Catalog data source. See Dremio Configuration for more information about supported authentication mechanisms.
Dremio reads the table metadata from AWS Glue and directly scans the data on S3 using its high-performance, massively parallel processing (MPP) engine. For this reason, you need to give permissions to connect to Glue as well as the permissions to read the data on S3 for those tables.
AWS IAM Policy for Accessing Amazon S3 and AWS Glue
Dremio recommends using the following AWS managed policy:
IAM policy for accessing Amazon S3 and AWS Glue{
"Version": "2012-10-17",
"Statement": [
# Allow Dremio to run the listed AWS Glue API operations.
{
"Effect": "Allow",
"Action": [
"glue:GetDatabase",
"glue:GetDatabases",
"glue:GetPartition",
"glue:GetPartitions",
"glue:GetTable",
"glue:GetTableVersions",
"glue:GetTables",
"glue:GetConnection",
"glue:GetConnections",
"glue:GetDevEndpoint",
"glue:GetDevEndpoints",
"glue:GetUserDefinedFunction",
"glue:GetUserDefinedFunctions",
"glue:BatchGetPartition"
],
"Resource": [
"*"
]
},
# Allow Dremio to read and write files in a bucket.
{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:PutObject"
],
"Resource": [
"arn:aws:s3:::aws-glue-*/*",
"arn:aws:s3:::*/*aws-glue-*/*"
]
},
# Allow Dremio to access the Amazon S3 buckets or folders with names containing either the 'aws-glue-' or 'crawler-public' prefixes.
{
"Effect": "Allow",
"Action": [
"s3:GetObject"
],
"Resource": [
"arn:aws:s3:::crawler-public*",
"arn:aws:s3:::aws-glue-*"
]
},
# Allow Dremio to create or delete tags in the AWS Glue catalog.
{
"Effect": "Allow",
"Action": [
"ec2:CreateTags",
"ec2:DeleteTags"
],
"Condition": {
"ForAllValues:StringEquals": {
"aws:TagKeys": [
"aws-glue-service-resource"
]
}
},
"Resource": [
"arn:aws:ec2:*:*:network-interface/*",
"arn:aws:ec2:*:*:security-group/*",
"arn:aws:ec2:*:*:instance/*"
]
}
]
}
Configuring AWS Glue Data Catalog as a Source
- On the Datasets page, to the right of Sources in the left panel, click
. - In the Add Data Source dialog, under Lakehouse Catalogs, select AWS Glue Data Catalog.
General
Users with proper privileges can configure access to AWS Glue Catalog with one of the three authentication methods.
Name
Specify a name for the data source. You cannot change the name after the source is created. The name cannot include the following special characters: /, :, [, or ].
AWS Region Selection
Specify a region from which you want to see the tables from AWS Glue. Only tables from this region will be shown after the connection is made.
Authentication
Choose one of the following authentication methods:
- AWS Access Key: All or allowed (if specified) buckets associated with this access key or IAM role to assume, if provided, will be available.
- Under AWS Access Key, enter the AWS access key ID.
- Under AWS Access Secret, provide the AWS access secret using one of the following methods:
- Dremio: Provide the access secret in plain text. Dremio stores the access secret.
- Azure Key Vault: Provide the URI for your stored secret using the format
https://<vault_name>.vault.azure.net/secrets/<secret_name> - AWS Secrets Manager: Provide the Amazon Resource Name (ARN) for the AWS Secrets Manager secret that holds the access secret, which is available in the AWS web console or using command line tools.
- HashiCorp Vault: Select your HashiCorp secrets engine from the dropdown and enter the access secret reference in the correct format.
- Under IAM Role to Assume, enter the IAM role that Dremio should assume in conjunction with the AWS Access Key authentication method.
- EC2 Metadata: All or allowed (if specified) buckets associated with the specified IAM role attached to EC2 or IAM role to assume, if provided, will be available.
- Under IAM Role to Assume, enter the IAM role that Dremio should assume in conjunction with the EC2 Metadata authentication method.
- EKS Pod Identity: Dremio can access all S3 buckets linked to the IAM role associated with the Kubernetes service account or the assumed IAM role. If you specify certain buckets, only those will be available.
- Under IAM Role to Assume, enter the IAM role that Dremio should assume when using the Pod Identity authentication method.
- AWS Profile: Dremio sources profile credentials from the specified AWS profile. For information on how to set up a configuration or credentials file for AWS, see AWS Custom Authentication.
- AWS profile (optional): The AWS profile name. If this is left blank, then the default profile will be used. For more information about using profiles in a credentials or configuration file, see AWS's documentation on Configuration and credential file settings.
The Encrypt connection option is enabled by default to encrypt the connection to AWS Glue. Clear the checkbox to disable encryption.
Allowed Databases
The allowed databases configuration is a post-connection filter on the databases visible from AWS Glue. When selective access to the databases within AWS Glue is required, the allowed databases filter will limit access within Dremio to only the needed databases per source connection, thus improving data security and source metadata refresh performance.
When the allowed database filter is empty, all databases from the AWS Glue source are visible in Dremio. When a database is added or removed from the filter, Dremio performs an asynchronous update to expose new databases and remove databases not included in the filter. Each entry in the allowed database filter must be a valid database name; misspelled or nonexistent databases are ignored.
Advanced Options
All configurations are optional.
Connection Properties
A list of additional connection properties that can be specified to use with the connection.
Locations in which Iceberg Tables are Created
Where the CREATE TABLE command creates an Iceberg table depends on the type of data source being used. For AWS Glue Data Sources, the root directory is assumed by default to be /user/hive/warehouse. If you want to create tables in a different location, you must specify the S3 address of an Amazon S3 bucket in which to create them:
- On the Advanced Options page of the Edit Source dialog, add this connection property:
hive.metastore.warehouse.dir. - Set the value to the S3 address of an S3 bucket.
The schema path and table name are appended to the root location to determine the default physical location for a new Iceberg table.
Lake Formation Integration
Lake Formation provides access controls and allows administrators to define security policies. Enabling this functionality and additional details on the configuration options below are described in more detail on the Integrating with Lake Formation page.
- Enforce AWS Lake Formation access permissions on datasets. Dremio checks any datasets included in the AWS Glue source for the required permissions to perform queries.
- Prefix to map Dremio users to AWS ARNs. Leave blank to default to the end user's username, or enter a REGEX expression.
- Prefix to map Dremio groups to AWS ARNs. Leave blank to default to the end user's group, or enter a REGEX expression.
Reflection Refresh
Specify how frequently Dremio refreshes Data Reflections based on the AWS Glue data source in the Reflection Refresh tab. Dremio refreshes every hour and expires after three hours by default.
- Never refresh -- Specifies how often to refresh based on hours, days, weeks, or never.
- Never expire -- Specifies how often to expire based on hours, days, weeks, or never.
Metadata
Specify how and how frequently Dremio refreshes metadata on the Metadata tab. By default, Dremio fetches top-level objects and dataset details every hour. Dremio retrieves details only for queried datasets by default to improve query performance.
Privileges
On the Privileges tab, you can grant privileges to specific users or roles. See Access Controls for additional information about privileges. All privileges are optional.
- For Privileges, enter the user name or role name that you want to grant access to and click the Add to Privileges button. The added user or role is displayed in the USERS/ROLES table.
- For the users or roles in the USERS/ROLES table, toggle the checkmark for each privilege you want to grant on the Dremio source that is being created.
- Click Save after setting the configuration.
Updating an AWS Glue Data Catalog Source
To update an AWS Glue Data Catalog source:
- On the Datasets page, under Lakehouse Catalogs in the panel on the left, find the name of the source you want to edit.
- Right-click the source name and select Settings from the list of actions. Alternatively, click the source name and then the
at the top right corner of the page. - In the Source Settings dialog, edit the settings you wish to update. Dremio does not support updating the source name. For information about the settings options, see Configuring AWS Glue Data Catalog as a Source.
- Click Save.
Deleting an AWS Glue Data Catalog Source
If the source is in a bad state (for example, Dremio cannot authenticate to the source or the source is otherwise unavailable), only users who belong to the ADMIN role can delete the source.
To delete an AWS Glue Data Catalog source, perform these steps:
- On the Datasets page, click Sources > Lakehouse Catalogs in the panel on the left.
- In the list of data sources, hover over the name of the source you want to remove and right-click.
- From the list of actions, click Delete.
- In the Delete Source dialog, click Delete to confirm that you want to remove the source.
Deleting a source causes all downstream views that depend on objects in the source to break.