On this page

    AWS Glue Catalog

    AWS Glue Catalog is a metadata store that lets you store and share metadata in the AWS cloud.

    Configuring AWS Glue Catalog as a Dremio Cloud source requires changes in your AWS account and your Dremio Cloud account.

    This topic describes how to configure an AWS Glue Catalog as a Dremio Cloud source:

    Supported Formats

    Dremio Cloud supports the following file and table formats for querying data.

    File Formats

    • CSV
    • Delimited files
    • XLSX
    • JSON
    • Parquet

    Table Formats

    • Apache Iceberg
    • Delta Lake

    AWS Configuration

    Add new permissions to the existing IAM users/roles that have been created for Dremio Cloud or create new IAM role/user for the Glue Catalog source configuration.

    Authentication using Project Data Credentials

    Use Project Data Credentials to access Glue Catalog using the Access Key or IAM role associated with your project. The key/role was created during signup, and is the default credential used to access all sources in your project. To use project data credentials to access the source, modify permissions associated with the project IAM role/user using policies that allows Dremio Cloud to access the Glue Catalog.

    For steps on how to attach new permission policies to an existing IAM role/user, see setting up AWS permissions.

    Authentication using Data Source Credentials

    Use Data Source credentials to access Glue Catalog using a source-specific Access Key or IAM role. The project key/role assumes this source-specific role you create to access Glue Catalog. To use source-specific credentials:

    1. Create an IAM role/user and add policies to provide Dremio Cloud access to Glue Catalog.
    2. Modify the project key/role to grant it permissions to assume the source-specific role you created.

    For steps on how to add IAM policies, see:

    IAM Policy for Accessing Glue

    The following IAM policy contains the minimum policy requirements to allow Dremio Cloud to read and query Glue Catalog. Replace the uppercase variables (ACCOUNT_ID, DATABASE_NAME, TABLE_NAME, and BUCKET_NAME) with actual values. Replace ACCOUNT_ID with the ID of your AWS account.

    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Action": [
                    "glue:GetDatabase",
                    "glue:GetDatabases",
                    "glue:GetPartition",
                    "glue:GetPartitions",
                    "glue:GetTable",
                    "glue:GetTableVersions",
                    "glue:GetTables",
                    "glue:GetConnection",
                    "glue:GetConnections",
                    "glue:GetDevEndpoint",
                    "glue:GetDevEndpoints",
                    "glue:GetUserDefinedFunction",
                    "glue:GetUserDefinedFunctions",
                    "glue:BatchGetPartition"
                ],
                "Resource": [
                    "arn:aws:glue:*:ACCOUNT_ID:catalog",
                    "arn:aws:glue:*:ACCOUNT_ID:table/DATABASE_NAME/TABLE_NAME",
                    "arn:aws:glue:*:ACCOUNT_ID:database/DATABASE_NAME"
                ]
            },
            {
                "Effect": "Allow",
                "Action": [
                    "s3:GetBucketLocation",
                    "s3:ListAllMyBuckets"
                ],
                "Resource": [
                    "arn:aws:s3:::*"
                ]
            },
            {
                "Effect": "Allow",
                "Action": [
                    "s3:ListBucket"
                ],
                "Resource": [
                    "arn:aws:s3:::BUCKET_NAME"
                ]
            },
            {
                "Effect": "Allow",
                "Action": [
                    "s3:GetObject"
                ],
                "Resource": [
                    "arn:aws:s3:::BUCKET_NAME/*"
                ]
            },
            {
                "Effect": "Allow",
                "Action": [
                    "s3:CreateBucket"
                ],
                "Resource": [
                    "arn:aws:s3:::aws-glue-*"
                ]
            },
            {
                "Effect": "Allow",
                "Action": [
                    "s3:GetObject",
                    "s3:PutObject",
                    "s3:DeleteObject"
                ],
                "Resource": [
                    "arn:aws:s3:::aws-glue-*/*",
                    "arn:aws:s3:::*/*aws-glue-*/*"
                ]
            },
            {
                "Effect": "Allow",
                "Action": [
                    "s3:GetObject"
                ],
                "Resource": [
                    "arn:aws:s3:::crawler-public*",
                    "arn:aws:s3:::aws-glue-*"
                ]
            },
            {
                "Effect": "Allow",
                "Action": [
                    "logs:CreateLogGroup",
                    "logs:CreateLogStream",
                    "logs:PutLogEvents"
                ],
                "Resource": [
                    "arn:aws:logs:*:*:/aws-glue/*"
                ]
            },
            {
                "Effect": "Allow",
                "Action": [
                    "ec2:CreateTags",
                    "ec2:DeleteTags"
                ],
                "Condition": {
                    "ForAllValues:StringEquals": {
                        "aws:TagKeys": [
                            "aws-glue-service-resource"
                        ]
                    }
                },
                "Resource": [
                    "arn:aws:ec2:*:*:network-interface/*",
                    "arn:aws:ec2:*:*:security-group/*",
                    "arn:aws:ec2:*:*:instance/*"
                ]
            }
        ]
    }
    

    Dremio Cloud Glue Catalog Configuration

    Perform the following steps to configure the Glue Catalog:

    1. In the Datasets UI, click the plus (+) icon to add a source in the Data Lakes section.

      Alternatively, click Data Lakes to display all data lake sources. Click the Add Data Lake button at the top-right of that page.

    2. In the Add Data Lake dialog, click Amazon Glue Catalog under Table Stores. The following section describes the source configuration tabs.

    General

    The fields in the General tab are required to configure a Glue Catalog source. Perform the following steps in the General tab:

    1. For Name, enter a name.

    2. For Authentication, select one of the following options.

      Authentication OptionDescriptionConfiguration Steps
      Project Data CredentialsDefault credentials used to access all sources in your project. Added as part of your signup and project creation.Project Data Credentials with Access Key/IAM Role
      Data Source CredentialsCredentials to access a a specific source. Created as part of the source configuration.Data Source Credentials with Access Key
      Data Source CredentialsCredentials to access a specific source. Created as part of the source configuration. Project role/key assumes this role to access the source.Data Source Credentials with IAM Role
    3. (Optional) For Encrypt connection, check the box to secure connections between the Glue Catalog source and Dremio Cloud.

    Advanced Options

    Click Advanced Options in the sidebar.

    note:

    All advanced options are optional.

    Advanced OptionDescription
    Enable asynchronous access when possibleEnables cloud caching for the Glue Catalog while adding a new source or editing it. This option is enabled by default.
    Connection PropertiesCustom key value pairs for the connection relevant to the source.
    1. Click Add Property.
    2. For Name, enter a connection property.
    3. For Value, enter the corresponding connection property value.

    These are Cache Options.

    Advanced OptionDescription
    Enable local caching when possibleBy default, this option is selected along with asynchronous access for cloud caching.
    Max percent of total available cache space to use when possibleSpecifies the disk quota that a source can use on any single executor node only when local caching is enabled. The default is 100 percent of the total disk space available on the mount point provided for caching. Click the up/down arrow of the value text field to change the percentage.

    Metadata

    You can configure settings to refresh metadata and handle datasets. Click Metadata in the sidebar.

    You can configure Dataset Handling and Metadata Refresh parameters.

    Dataset Handling

    These are the Dataset Handling parameters.

    note:

    All Dataset Handling parameters are optional.

    ParameterDescription
    Remove dataset definitions if underlying data is unavailableBy default, Dremio Cloud removes dataset definitions if underlying data is unavailable. Useful when files are temporarily deleted and added back in the same location with new sets of files.
    Automatically format files into physical datasets when users issue queriesChoose this option if you want Dremio Cloud to automatically format files into datasets when you run queries. Useful when the data contains CSV files with non-default options.
    Metadata Refresh

    These are the Metadata Refresh parameters:

    • Dataset Discovery: The refresh interval for fetching top-level source object names such as databases and tables. Set the time interval using this parameter.

      ParameterDescription
      (Optional) Fetch everyYou can choose to set the frequency to fetch object names in minutes, hours, days, or weeks. The default frequency to fetch object names is 1 hour.
    • Dataset Details: The metadata that Dremio Cloud needs for query planning such as information required for fields, types, shards, statistics, and locality. These are the parameters to fetch the dataset information.

    note:

    All Dataset Details parameters are optional.

    ParameterDescription
    Fetch modeYou can choose to fetch only from queried datasets that are set by default. Dremio Cloud updates details for previously queried objects in a source. Fetching from all datasets is deprecated.
    Fetch everyYou can choose to set the frequency to fetch dataset details in minutes, hours, days, or weeks. The default frequency to fetch dataset details is 1 hour.
    Expire afterYou can choose to set the expiry time of dataset details in minutes, hours, days, or weeks. The default expiry time of dataset details is 3 hours.

    Privileges

    You can grant privileges to specific users or roles.

    1. (Optional) For Privileges, enter the user name or role name that you want to grant access to and click the Add to Privileges button. The added user or role is displayed in the Users table.
    2. (Optional) For the users or roles in the Users table, toggle the green checkmark for each privilege you want to grant to the Glue Catalog that is being created.

    Click Save after setting the configuration.

    Editing a Glue Catalog Source

    To edit a Glue Catalog:

    1. In the Datasets UI, click Data Lakes at the bottom-left of the page. A list of data lakes is displayed.
    2. Under the Action column, click the Settings (gear) icon for the data lake source you want to edit. From the list of actions, click Edit Details. Alternatively, you can click the data lake and click the Settings (gear) icon of the source dialog.
    3. In the Edit Source dialog, you cannot edit the name and AWS region. Editing all other parameters are optional. For parameters and advanced options, see Dremio Cloud Glue Catalog Configuration.
    4. Click Save.

    Removing a Glue Catalog Source

    To remove a Glue Catalog source, perform these steps:

    1. In the Datasets UI, click Data Lakes at the bottom-left of the page. A list of data lakes is displayed.
    2. Under the Action column, click the Settings (gear) icon for the data lake source that you want to delete.
    3. From the list of actions, click Remove Source. Confirm that you want to remove the source.

    Limitations

    • VPC-restricted buckets are not supported.