Skip to main content

AWS Glue Data Catalog

The AWS Glue Data Catalog is a metadata store that lets you store and share metadata in the AWS Cloud.

Supported Formats

Dremio can query data stored in S3 in various file formats (including delimited, Excel (XLSX), and Parquet) and Apache Iceberg or Delta Lake table formats.

Connect to an AWS Glue Data Catalog

  1. In the Dremio console, click Add Data on the Home page.
  2. In the Add Data dialog, select AWS Glue Data Catalog.
  3. Configure the connection using the sections below, then click Save.

General

  • Name – Specify a name for the connection. You cannot change the name after the connection is created. The name cannot include the following special characters: /, :, [, or ].

  • AWS Region Selection – Specify the region hosting the AWS Glue catalog.

  • Authentication – Provide the IAM Role ARN that Dremio will assume to access the catalog:

  • Allowed Databases – (Optional) The allowed databases configuration is a post-connection filter on the databases visible from AWS Glue. When selective access to the databases within AWS Glue is required, the allowed databases filter limits access within Dremio to only the needed databases, improving data security and metadata refresh performance.

    When the allowed databases filter is empty, all databases from the AWS Glue Data Catalog are visible in Dremio. When a database is added to or removed from the filter, Dremio performs an asynchronous update to expose new databases and remove databases not included in the filter. Each entry in the allowed databases filter must be a valid database name; misspelled or nonexistent databases are ignored.

  • Encrypted Connection – (Optional) To secure the connections between the AWS Glue Data Catalog and Dremio, select the Encrypt connection checkbox.

Advanced Options

  • Connection Properties – You can add key-value pairs to provide custom connection properties.
    1. Click Add Property.
    2. For Name, enter a connection property.
    3. For Value, enter the corresponding connection property value.
  • Lake Formation – Lake Formation provides access controls and allows administrators to define security policies. Enabling this functionality and additional details on the configuration options below are described in AWS Lake Formation.
    • Enforce AWS Lake Formation access permissions on datasets – Dremio checks any datasets included in the AWS Glue Data Catalog for the required permissions to perform queries.
    • Prefix to map Dremio users to AWS ARNs – Leave blank to default to the end user's username, or enter a regular expression.
    • Prefix to map Dremio groups to AWS ARNs – Leave blank to default to the end user's group, or enter a regular expression.

Under Cache Options:

  • Enable local caching when possible – Selected by default, along with asynchronous access for cloud caching. Uncheck the checkbox to disable this option.
  • Max percent of total available cache space to use when possible – Specifies the disk quota, as a percentage, available on any single executor node when local caching is enabled. The default is 100 percent of the total disk space available on the mount point provided for caching. You can either manually enter a percentage in the value field or use the arrows to the far right to adjust the percentage.

Reflection Refresh

  • Never refresh: Select to prevent automatic Reflection refresh; otherwise, the default is to refresh automatically.
  • Refresh every: Define how often to refresh Reflections, specified in hours, days, or weeks. This option is ignored if Never refresh is selected.
  • Set refresh schedule: Specify the daily or weekly schedule.
  • Never expire: Select to prevent Reflections from expiring; otherwise, the default is to expire automatically after the time limit specified in Expire after.
  • Expire after: The time limit after which Reflections expire and are removed from Dremio, specified in hours, days, or weeks. This option is ignored if Never expire is selected.

Metadata

Dataset Handling

  • Remove dataset definitions if underlying data is unavailable – By default, Dremio removes dataset definitions if underlying data is unavailable. This option is for scenarios when files are temporarily deleted and added back in the same location with new sets of files.

Metadata Refresh

  • Dataset Discovery – The refresh interval for retrieving top-level object names such as databases and tables. Use this parameter to set the time interval. You can choose to set the frequency to collect object names in minutes, hours, days, or weeks. The default frequency to fetch object names is one hour.
  • Dataset Details – The metadata that Dremio needs for query planning, such as information required for fields, types, shards, statistics, and locality.
    • Fetch mode – You can choose to fetch only from queried datasets. Dremio updates details for previously queried objects. By default, this is set to Only Queried Datasets.
    • Fetch every – You can choose to set the frequency to fetch dataset details in minutes, hours, days, or weeks. The default frequency to fetch dataset details is one hour.
    • Expire after – You can choose to set the expiry time of dataset details in minutes, hours, days, or weeks. The default expiry time of dataset details is three hours.

Privileges

This connection inherits privileges from Project settings. To grant specific users or roles additional privileges in this connection:

  1. Enter the username or role name that you want to grant access to and click the Add to Privileges button. The added user or role is displayed in the USERS/ROLES table.
  2. For the users or roles in the USERS/ROLES table, toggle the checkmark for each privilege you want to grant on the Dremio source that is being created.
  3. Click Save after setting the configuration.

See Privileges for additional information about privileges.

Edit an AWS Glue Data Catalog Connection

  1. On the Open Catalog page, under Connections, right-click the connection and select Settings.
  2. Update the connection configuration as needed.
  3. Click Save.

Delete an AWS Glue Data Catalog Connection

  1. On the Open Catalog page, under Connections, right-click the connection and select Delete.
  2. Click Delete to confirm.

Add an AWS Glue Access Policy to a Custom Role

To add the required AWS Glue access policy to your custom role:

  1. On the Roles page, click the role name. Use the Search field to locate the role if needed.

  2. From the Roles page, in the Permissions section, click Add permissions > Create inline policy.

  3. On the Create policy page, click the JSON tab.

  4. Delete the current JSON policy and copy the IAM Policy Template for AWS Glue Catalog.

    IAM Policy Template for AWS Glue Catalog
    {
    "Version": "2012-10-17",
    "Statement": [
    {
    "Sid": "AccessGlueCatalog",
    "Effect": "Allow",
    "Action": [
    "glue:GetDatabase",
    "glue:GetDatabases",
    "glue:GetPartition",
    "glue:GetPartitions",
    "glue:GetTable",
    "glue:GetTableVersions",
    "glue:GetTables",
    "glue:GetConnection",
    "glue:GetConnections",
    "glue:GetDevEndpoint",
    "glue:GetDevEndpoints",
    "glue:GetUserDefinedFunction",
    "glue:GetUserDefinedFunctions",
    "glue:BatchGetPartition"
    ],
    "Resource": [
    "*"
    ]
    },
    {
    "Sid": "ReadWriteGlueS3Buckets",
    "Effect": "Allow",
    "Action": [
    "s3:GetObject",
    "s3:PutObject"
    ],
    "Resource": [
    "arn:aws:s3:::aws-glue-*/*",
    "arn:aws:s3:::*/*aws-glue-*/*"
    ]
    },
    {
    "Sid": "ReadPublicGlueBuckets",
    "Effect": "Allow",
    "Action": [
    "s3:GetObject"
    ],
    "Resource": [
    "arn:aws:s3:::crawler-public*",
    "arn:aws:s3:::aws-glue-*"
    ]
    },
    {
    "Sid": "ManageGlueServiceTags",
    "Effect": "Allow",
    "Action": [
    "ec2:CreateTags",
    "ec2:DeleteTags"
    ],
    "Condition": {
    "ForAllValues:StringEquals": {
    "aws:TagKeys": [
    "aws-glue-service-resource"
    ]
    }
    },
    "Resource": [
    "arn:aws:ec2:*:*:network-interface/*",
    "arn:aws:ec2:*:*:security-group/*",
    "arn:aws:ec2:*:*:instance/*"
    ]
    }
    ]
    }
  5. Click Next.

  6. On the Review policy page, in the Name field, enter a name for the policy.

  7. Click Create policy. The policy is created and you are returned to the Roles page.

AWS Lake Formation

AWS Lake Formation provides access controls for datasets in the AWS Glue Data Catalog and is used to define security policies from a centralized location that may be shared across multiple tools. Dremio may be configured to refer to this service to verify user access to contained datasets.

Requirements

Lake Formation Workflow

When Lake Formation is properly configured, Dremio adheres to the following workflow each time an end user attempts to access, edit, or query datasets with managed privileges:

  1. Dremio enforces access control. See Configure Sources for Lake Formation for access control recommendations.
  2. Dremio checks each table to determine if those stored in the AWS Glue Data Catalog are configured to use Lake Formation for security.
    • If one or more datasets leverage Lake Formation, Dremio determines the user ARNs to use when checking against Lake Formation.
  3. Dremio queries Lake Formation to determine a user's access level to the datasets using the user/group ARNs.
    • If the user has access to the datasets specified within the query's scope, the query proceeds.
    • If the user lacks access, the query fails with a permission error.

Configure Sources for Lake Formation

Lake Formation integration is dependent on the mapping of user/group names in Dremio to the IAM user/group ARNs used by AWS.

To configure an existing or new AWS Glue Data Catalog, you must set the following options:

  1. Navigate to the Advanced Options tab.

  2. Enable Enforce AWS Lake Formation access permissions on datasets.

  3. Fill in the user and group prefix settings as instructed in the Lake Formation Permissions Reference. For example, if you are using a SAML provider in AWS:

    • User prefix with SAML: arn:aws:iam::<AWS_ACCOUNT_ID>:saml-provider/<PROVIDER_NAME_IN_AWS>:user/
    • Group prefix with SAML: arn:aws:iam::<AWS_ACCOUNT_ID>:saml-provider/<PROVIDER_NAME_IN_AWS>:group/
    note

    Best Practice: On the Privileges tab, we recommend enabling the Select privilege for All Users to allow non-admin users to access the AWS Glue Data Catalog.

Lake Formation Cell-Level Security

Dremio supports AWS Lake Formation cell-level security with row-level access permissions based on AWS Lake Formation PartiQL expressions. If the user does not have read permissions on a column or cell, Dremio masks the data in that column or cell with a NULL value.

To speed up query planning, Dremio uses the AWS Lake Formation permissions cache for each table. By default, the cache is enabled and reuses previously loaded permissions for up to 3600 seconds (1 hour).

Limitations

  • VPC-restricted S3 buckets are not supported.