AWS Glue Data Catalog
The AWS Glue Data Catalog is a metadata store that lets you store and share metadata in the AWS Cloud.
Supported Formats
Dremio can query data stored in S3 in file formats (including delimited, Excel (XLSX), and Parquet) and Apache Iceberg or Delta Lake table formats.
Add an AWS Glue Data Catalog
To add an AWS Glue Data Catalog to your project:
-
From the Datasets page, to the right of Sources in the left panel, click
. -
In the Add Data Source dialog, under Lakehouse Catalogs, select AWS Glue Data Catalog.
General
To configure an AWS Glue Data Catalog source:
-
Name – Specify a name for the data source. You cannot change the name after the source is created. The name cannot include the following special characters:
/,:,[, or]. -
AWS Region Selection – Specify the region hosting the AWS Glue catalog.
-
Authentication – Provide the role that Dremio will assume to gain access to the source:
- Create an AWS IAM role in your AWS account that trusts Dremio.
- Add an AWS Glue Access Policy to your custom role that provides access to your AWS Glue Data Catalog source.
- Add the Role ARN to the source configuration.
-
Allowed Databases – (Optional) The allowed databases configuration is a post-connection filter on the databases visible from AWS Glue. When selective access to the databases within AWS Glue is required, the allowed databases filter limits access within Dremio to only the needed databases per source connection, improving data security and source metadata refresh performance.
When the allowed databases filter is empty, all databases from the AWS Glue source are visible in Dremio. When a database is added to or removed from the filter, Dremio performs an asynchronous update to expose new databases and remove databases not included in the filter. Each entry in the allowed databases filter must be a valid database name; misspelled or nonexistent databases are ignored.
-
Encrypted Connection – (Optional) To secure the connections between AWS Glue and Dremio, select the Encrypt connection checkbox.
Advanced Options
Click Advanced Options in the left menu sidebar.
- Connection Properties – You can add key-value pairs to provide custom connection properties relevant to the source.
- Click Add Property.
- For Name, enter a connection property.
- For Value, enter the corresponding connection property value.
- Lake Formation – Lake Formation provides access controls and allows administrators to define security policies. Enabling this functionality and additional details on the configuration options below are described in AWS Lake Formation.
- Enforce AWS Lake Formation access permissions on datasets – Dremio checks any datasets included in the AWS Glue source for the required permissions to perform queries.
- Prefix to map Dremio users to AWS ARNs – Leave blank to default to the end user's username, or enter a regular expression.
- Prefix to map Dremio groups to AWS ARNs – Leave blank to default to the end user's group, or enter a regular expression.
Under Cache Options:
- Enable local caching when possible – Selected by default, along with asynchronous access for cloud caching. Uncheck the checkbox to disable this option.
- Max percent of total available cache space to use when possible – Specifies the disk quota, as a percentage, that a source can use on any single executor node only when local caching is enabled. The default is 100 percent of the total disk space available on the mount point provided for caching. You can either manually enter a percentage in the value field or use the arrows to the far right to adjust the percentage.
Reflection Refresh
Click Reflection Refresh in the source settings sidebar. This section lets you manage how often Reflections are refreshed and how long data can be served before expiration. To learn more about Reflections, see Manual Reflections. All Reflection parameters are optional.
You can set the following refresh policies for Reflections:
- Refresh period – Manage the refresh period by either enabling the option to never refresh or setting a refresh frequency in hours, days, or weeks. The default frequency to refresh Reflections is every hour.
- Expiration period – Set the expiration period for the length of time that data can be served by either enabling the option to never expire or setting an expiration time in hours, days, or weeks. The default expiration time is three hours.
Metadata
Click Metadata in the left menu sidebar. This section lets you configure settings to refresh metadata and enable other dataset options. All metadata parameters are optional.
You can configure Dataset Handling and Metadata Refresh parameters.
Dataset Handling
- Remove dataset definitions if underlying data is unavailable – By default, Dremio removes dataset definitions if underlying data is unavailable. This option is for scenarios when files are temporarily deleted and added back in the same location with new sets of files.
Metadata Refresh
-
Dataset Discovery – The refresh interval for retrieving top-level source object names such as databases and tables. Use this parameter to set the time interval. You can choose to set the frequency to collect object names in minutes, hours, days, or weeks. The default frequency to fetch object names is one hour.
-
Dataset Details – The metadata that Dremio needs for query planning, such as information required for fields, types, shards, statistics, and locality.
- Fetch mode – You can choose to fetch only from queried datasets. Dremio updates details for previously queried objects in a source. By default, this is set to Only Queried Datasets.
- Fetch every – You can choose to set the frequency to fetch dataset details in minutes, hours, days, or weeks. The default frequency to fetch dataset details is one hour.
- Expire after – You can choose to set the expiry time of dataset details in minutes, hours, days, or weeks. The default expiry time of dataset details is three hours.
Privileges
Click Privileges in the left menu sidebar. This section lets you grant privileges to specific users or roles. To learn more about how Dremio allows for the implementation of granular-level privileges, see Privileges.
To add a privilege for a user or role:
-
In the Add User/Role field, enter the user or role name to which you want to apply privileges.
-
Click Add to Privileges. The user or role is added to the Users table.
To set privileges for a user or role:
-
In the Users table, identify the user to set privileges for and click under the appropriate column (Select, Alter, Create Table, etc.) to either enable or disable that privilege. A green checkmark indicates that the privilege is enabled.
-
Click Save.
After you have connected Dremio to the AWS Glue Data Catalog, you will be able to edit the Data Catalog and remove it when it is no longer needed.
Update an AWS Glue Data Catalog Source
To update an AWS Glue Data Catalog source:
- From the Datasets page, in the Lakehouse Catalogs section, right-click on the source and select Settings.
- For information about these settings and guidance on the changes you can make, see Add an AWS Glue Data Catalog.
- Click Save.
Delete an AWS Glue Data Catalog Source
To remove a Data Catalog source:
- From the Datasets page, in the Lakehouse Catalogs section, right-click on the source and select Delete.
- Click Delete again to confirm.
Add an AWS Glue Access Policy to a Custom Role
To add the required AWS Glue access policy to your custom role:
-
On the Roles page, click the role name. Use the Search field to locate the role if needed.
-
From the Roles page, in the Permissions section, click Add permissions > Create inline policy.
-
On the Create policy page, click the JSON tab.
-
Delete the current JSON policy and copy the IAM Policy Template for AWS Glue Catalog.
IAM Policy Template for AWS Glue Catalog{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "AccessGlueCatalog",
"Effect": "Allow",
"Action": [
"glue:GetDatabase",
"glue:GetDatabases",
"glue:GetPartition",
"glue:GetPartitions",
"glue:GetTable",
"glue:GetTableVersions",
"glue:GetTables",
"glue:GetConnection",
"glue:GetConnections",
"glue:GetDevEndpoint",
"glue:GetDevEndpoints",
"glue:GetUserDefinedFunction",
"glue:GetUserDefinedFunctions",
"glue:BatchGetPartition"
],
"Resource": [
"*"
]
},
{
"Sid": "ReadWriteGlueS3Buckets",
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:PutObject"
],
"Resource": [
"arn:aws:s3:::aws-glue-*/*",
"arn:aws:s3:::*/*aws-glue-*/*"
]
},
{
"Sid": "ReadPublicGlueBuckets",
"Effect": "Allow",
"Action": [
"s3:GetObject"
],
"Resource": [
"arn:aws:s3:::crawler-public*",
"arn:aws:s3:::aws-glue-*"
]
},
{
"Sid": "ManageGlueServiceTags",
"Effect": "Allow",
"Action": [
"ec2:CreateTags",
"ec2:DeleteTags"
],
"Condition": {
"ForAllValues:StringEquals": {
"aws:TagKeys": [
"aws-glue-service-resource"
]
}
},
"Resource": [
"arn:aws:ec2:*:*:network-interface/*",
"arn:aws:ec2:*:*:security-group/*",
"arn:aws:ec2:*:*:instance/*"
]
}
]
} -
Click Next.
-
On the Review policy page, in the Name field, enter a name for the policy.
-
Click Create policy. The policy is created and you are returned to the Roles page.
AWS Lake Formation
AWS Lake Formation provides access controls for datasets in the AWS Glue Data Catalog and is used to define security policies from a centralized location that may be shared across multiple tools. Dremio may be configured to refer to this service to verify user access to contained datasets.
Requirements
- Identity provider service set up
- (Recommended) SAML connection with AWS
- Permissions set up in Lake Formation
- AWS Glue Data Catalog connected to Dremio
- User and Group ARN prefixes specified and enabled
Lake Formation Workflow
When Lake Formation is properly configured, Dremio adheres to the following workflow each time an end user attempts to access, edit, or query datasets with managed privileges:
-
Dremio enforces access control. See Configure Sources for Lake Formation for access control recommendations.
-
Dremio checks each table to determine if those stored in the AWS Glue source are configured to use Lake Formation for security.
- If one or more datasets leverage Lake Formation, Dremio determines the user ARNs to use when checking against Lake Formation.
-
Dremio queries Lake Formation to determine a user's access level to the datasets using the user/group ARNs.
-
If the user has access to the datasets specified within the query's scope, the query proceeds.
-
If the user lacks access, the query fails with a permission error.
-
Configure Sources for Lake Formation
Lake Formation integration is dependent on the mapping of user/group names in Dremio to the IAM user/group ARNs used by AWS.
To configure an existing or new AWS Glue Data Catalog source, you must set the following options:
-
From your existing source or upon creating an AWS Glue Data Catalog source, navigate to the Advanced Options tab.
-
Enable Enforce AWS Lake Formation access permissions on datasets.
-
Fill in the user and group prefix settings as instructed in the Lake Formation Permissions Reference. For example, if you are using a SAML provider in AWS:
-
User prefix with SAML:
arn:aws:iam::<AWS_ACCOUNT_ID>:saml-provider/<PROVIDER_NAME_IN_AWS>:user/ -
Group prefix with SAML:
arn:aws:iam::<AWS_ACCOUNT_ID>:saml-provider/<PROVIDER_NAME_IN_AWS>:group/
noteBest Practice: On the Privileges tab, we recommend enabling the Select privilege for All Users to allow non-admin users to access the AWS Glue source from Dremio.
-
Lake Formation Cell-Level Security
Dremio supports AWS Lake Formation cell-level security with row-level access permissions based on AWS Lake Formation PartiQL expressions. If the user does not have read permissions on a column or cell, Dremio masks the data in that column or cell with a NULL value.
To speed up query planning, Dremio uses the AWS Lake Formation permissions cache for each table. By default, the cache is enabled and reuses previously loaded permissions for up to 3600 seconds (1 hour).
Limitations
- VPC-restricted S3 buckets are not supported.