On this page

    Tables

    The SQL commands for tables enable you to create and work with tables in the scratch directory and in data lake sources (such as Amazon S3). Apache Iceberg tables are created in AWS Glue data sources, Amazon S3 data sources, or external Nessie data sources.

    Tables created in these sources can be queried, altered, analyzed, and dropped by users and groups with the appropriate permissions. For information about querying tables (including Apache Iceberg tables), see SELECT Statements.

    Apache Iceberg Tables

    Prerequisites for Apache Iceberg Tables

    Before you attempt to create Iceberg tables, ensure that you are using an AWS Glue, Amazon S3, or external Nessie data source. Also ensure that any existing Apache Iceberg tables that you want to use in your data source are at Apache Iceberg version 2.

    Default Table Formats Used for New Tables

    AWS Glue data sources added to projects default to using the Apache Iceberg table format.

    Amazon S3 data sources added to projects default to using the Apache Parquet table format. Follow these steps to ensure that the default table format for new tables is Apache Iceberg:

    1. In Dremio, click the Amazon S3 data source.
    2. Click the gear icon in the top-right corner above the list of the data source’s contents.
    3. On the Advanced Options page of the Edit Source dialog, select ICEBERG under Default CTAS Format.
    4. Click Save.

    Locations in which Iceberg Tables are Created

    Where the CREATE TABLE command creates a table depends on the type of data source being used.

    Location in AWS Glue Data Sources

    The root directory is assumed by default to be /user/hive/warehouse.

    If you want to create tables in a different location, you must specify the S3 address of an Amazon S3 bucket in which to create them:

    1. In Dremio Cloud, click the AWS Glue data source.
    2. Click the gear icon in the top-right corner above the list of the data source’s contents.
    3. On the Advanced Options page of the Edit Source dialog, add this connection property: hive.metastore.warehouse.dir
    4. Set the value to the S3 address of an S3 bucket.

    To the root location are appended the schema path and table name to determine the default physical location for a new table. For example, this CREATE TABLE command creates the table table_A in the directory <rootdir>/database/schema/table_A

    CREATE TABLE database.schema.table_A
    

    Location in Amazon S3 Data Sources

    The root physical location is the main root directory for the filesystem. From this location, the path and table name are appended to determine the physical location for a new table.

    For example, this CREATE TABLE command creates the table table_A in the directory rootdir/folder1/folder2/table_A:

    CREATE TABLE <Amazon_S3_data_source>.folder1.folder2.table_A
    

    Location in Nessie Data Sources

    Top-level Nessie schemas have a configurable physical storage. This is used as the default root physical location.

    In the project store each top level Nessie schema has its own directory path. So for example in the project’s Nessie the top level schema “marketing” would be located in “project_store/marketing” and this directory would be used by default as the root physical location. From there, the same schema.table resolution as described for Hive above would apply.

    Sources and Privileges

    In order to use these SQL commands, the appropriate privileges must be set for each data source that you want to work with, including the spaces of other users in your project and the connected data lake sources. For information about setting privileges, see Privileges.

    Creating a Table (CTAS)

    Tables that are created using the CREATE TABLE AS (CTAS) command are in Parquet file format, which can be formatted to a table.

    Syntax
    CREATE TABLE [IF NOT EXISTS] <table_path>
      [ PARTITION BY ( <column_name1>, <column_name2>, ... ) ]
      
      <!-- Note: The following syntax applies to Apache Iceberg tables only
      [ PARTITION BY ( [<column_name>|<partition_transform>] [ , ... ] ) ] 
      -->
      
      [LOCALSORT BY ( <column_name> ) ]
      AS <query>
    

    Parameters

    <table_path>

    String

    The path of the table that you want to create. The name of the table must be unique. The table can be in the scratch directory or a data lake source.


    AS <query>

    String

    Use the SELECT statement to populate the new table using data from an existing table or view.

    Parameters for Iceberg Tables Only

    PARTITION BY ( [<column_name>|<partition_transform>] [ , ... ] )

    String

    Optional

    (Optional) Partitions the table data on the values in a single column or by using one of these partition-transformation functions:

    TransformDescription
    identity( <col> )Explicitly specified identity transform
    year( <col> )Partition by year. The column must use the TIMESTAMP data type.
    month( <ts_col> )Partition by month. The column must use the TIMESTAMP data type.
    day( <ts_col> )Partition by day. The column must use the TIMESTAMP data type.
    hour( <ts_col> )Partition by hour. The column must use the TIMESTAMP data type.
    bucket( <count>, <col> )Partition by hashed value into <count> buckets
    truncate( <length>, <col> )Partition by truncated value.
    • Strings are truncated to the specified length.
    • Integer and biginteger values are truncated to bins.
      Example: truncate(10, i) produces 0, 10, 20, and so on.

    Optional Parameters

    IF NOT EXISTS

    String

    Optional

    Causes Dremio Cloud to run the statement only if a table with the same name in the same path does not already exist.


    PARTITION BY ( <column_name1>, <column_name2>, ... )

    String

    Optional

    The names of the columns that you want to partition the table by. When you designate column(s) using this clause, Dremio will store the rows containing the same hash partition values in their own Parquet files. Using this clause provides speed optimization for when your WHERE clause matches your PARTITION BY clause.

    Notes:

    • If you partition a table by a column that has more distinct values than Dremio can partition by, you will receive an error message. If you receive this error message, try partitioning your table by a different column.

    • If the table is not partitioned, you will receive an error when you attempt to refresh the metadata using the PARTITION BY clause.


    LOCALSORT BY <column_name>

    String

    Optional

    The name of the column that you want to sort the table by. Sorts each fragment (that is, the Parquet file) by the designated column. If your ORDER BY column matches your LOCALSORT column or you are doing a range filter on your LOCALSORT column, your query will be faster because the data is already sorted.

    Examples

    note:

    The following examples use Dremio’s Samples Data Lake, which includes a number of sample datasets that you can practice with. If you have not added this data lake to your project, see Add Dremio’s Samples Data Lake to Your Project.

    Create a table in your project store
    CREATE TABLE $scratch.demo_table AS
      SELECT * 
      FROM Samples."samples.dremio.com"."zips.json"
    
    Create a table in your project store using PARTITION BY and LOCALSORT
    CREATE TABLE $scratch.demo_table2
      PARTITION BY (state)
      LOCALSORT BY (city)
      AS SELECT * 
      FROM Samples."samples.dremio.com"."zips.json"
    

    Altering a Table

    The ALTER command updates a table’s schema in a new snapshot.

    Syntax
    ALTER TABLE <table_path>
    [ ADD COLUMNS ( <column_name1> <data_type1>( <size1> ), <column_name2> <data_type2>( <size2> ), ...) ]
    | [ DROP COLUMN <column_name> ]
    | [ { ALTER | MODIFY } COLUMN <source_column_name> <source_column_name> <new_data_type>( <size> ) ]
    | [ REFRESH METADATA ] [ FOR PARTITIONS ( <partition_name> = '<value>' ) ]
        [ { AVOID | AUTO } PROMOTION ]
        [ { FORCE | LAZY } UPDATE ]
        [ { MAINTAIN | DELETE } WHEN MISSING ]
    | [ FORGET METADATA ]
    

    Parameters

    <table_path>

    String

    The path of the table that you want to refresh. The source can be in the scratch directory or a data lake source.

    Optional Parameters

    ADD COLUMNS ( <column_name1> <data_type1>( <size1> ), <column_name2> <data_type2>( <size2> ), ...)

    String

    Creates one or more columns that have the specified names, data types, and character limits.


    DROP COLUMN <column_name>

    String

    Drops the specified column. This action cannot be undone.


    { ALTER | MODIFY } COLUMN <source_column_name> <source_column_name> <new_data_type>( <size> )

    String

    Changes the data type for a specified column.

    Note:

    Currently, for non-Iceberg tables, you can only modify the data type for a column. Column names cannot be renamed.


    REFRESH METADATA

    Refreshes the metadata associated with the specified table. You can choose to either do a full or partial metadata refresh. A full refresh will update the metadata for the entire table. A partial refresh will enable you to only update specified partitions. Optional clauses are available for refreshing a table's metadata. If you choose to use two or more of these clauses, they must be entered in a specified priority order:

    1. Promotion option: either AVOID PROMOTION or AUTO PROMOTION

    2. Update option: either FORCE UPDATE or LAZY UPDATE

    3. Missing option: either MAINTAIN WHEN MISSING or DELETE WHEN MISSING

    Note:

    If the table is not partitioned, you will receive an error when you attempt to refresh the metadata using the FOR PARTITIONS clause.


    FORGET METADATA

    Deletes the metadata information stored in Dremio for the specified table until the next metadata refresh. The dataset can still be queried using SQL.

    REFRESH METADATA

    FOR PARTITIONS ( <partition_name> = '<value>' )

    String

    Optional

    Use this clause to do a partial refresh of the table's metadata.

    • The <partition_name> identifies the name of the partition to be refreshed.

    • The <value> identifies the specific partition that should be refreshed. Must be contained in single quotes.


    { AVOID | AUTO } PROMOTION

    Optional

    Optional clauses that determine whether files and folders are promoted to datasets when you run a query. If you use the PROMOTION clause with the other optional clauses, you must enter them in a priority order: (1) PROMOTION clause, (2) UPDATE clause, and (3) MISSING clause.

    clauseDescription
    AVOID PROMOTIONPrevents queries from promoting files/folders to datasets.
    AUTO PROMOTIONAllows queries to promote files/folders to datasets. This is the default option when you do not include a promotion clause.


    { FORCE | LAZY } UPDATE

    Optional

    Optional clauses that determine whether metadata is updated when you run a query. If you use the UPDATE clause with the other optional clauses, you must enter them in a priority order: (1) PROMOTION clause, (2) UPDATE clause, and (3) MISSING clause.

    ClauseDescription
    FORCE UPDATEForces a full update of metadata.
    LAZY UPDATEDoes not perform a full update of metadata. This is the default option when you do not include an update clause.


    { MAINTAIN | DELETE } WHEN MISSING

    Optional

    Optional clauses that determine how missing metadata is handled when you run a query. If you use the MISSING clause with the other optional clauses, you must enter them in a priority order: (1) PROMOTION clause, (2) UPDATE clause, and (3) MISSING clause.

    ClauseDescription
    MAINTAIN WHEN MISSINGPrevents missing metadata from being deleted during refresh.
    DELETE WHEN MISSINGDeletes missing metadata during refresh. This is the default option when you do not include a clause.

    Examples

    note:

    The following examples use Dremio’s Samples Data Lake, which includes a number of sample datasets that you can practice with. If you have not added this data lake to your project, see Add Dremio’s Samples Data Lake to Your Project.

    Add a column
    ALTER TABLE $scratch.demo_table ADD COLUMNS (county varchar)
    
    Modify a column
    ALTER TABLE $scratch.demo_table MODIFY COLUMN _id _id bigint
    
    Refresh all the metadata for a table
    ALTER TABLE $scratch.demo_table REFRESH METADATA
    
    Refresh all the metadata for a table using optional clauses
    ALTER TABLE $scratch.demo_table REFRESH METADATA
      AUTO PROMOTION 
      LAZY UPDATE 
      MAINTAIN WHEN MISSING
    
    Refresh the metadata for a single partition
    ALTER TABLE Samples."samples.dremio.com"."zips.json"  
      REFRESH METADATA FOR PARTITIONS (state = 'TX')
    
    Refresh the metadata for a single partition using optional clauses
    ALTER TABLE Samples."samples.dremio.com"."zips.json"  
      REFRESH METADATA FOR PARTITIONS (state = 'TX')
      AUTO PROMOTION 
      LAZY UPDATE 
      MAINTAIN WHEN MISSING
    
    Forget the metadata for a table
    ALTER TABLE $scratch.demo_table FORGET METADATA
    

    Analyzing a Table

    Computes and deletes statistics for tables, including:

    • Estimated number of distinct values
    • Number of rows
    • Number of null values
    Syntax
    ANALYZE TABLE <table_name>
      FOR { ALL COLUMNS |  COLUMNS } ( <column_name1>, <column_name2>, ... )
      { COMPUTE | DELETE } STATISTICS 
    

    Parameters

    <table_path>

    String

    The path to the table that you want to generate statistics for. The source can be in the scratch directory or a data lake source.


    { ALL COLUMNS | COLUMNS } ( <column_name1>, <column_name2>, ... )

    String

    The columns that you want to generate statistics for. You can choose to either include all columns or specify one or more columns for the statistics.

    Note:

    If you do not specify the columns, Dremio will compute statistics for all the columns contained in the table. If you want to compute statistics for specific columns, you can specify more than one column by separating each one with a comma.


    { COMPUTE | DELETE } STATISTICS

    String

    Choose to either compute or delete statistics for the specified columns in the table.

    Examples

    Analyze a table
    ANALYZE TABLE $scratch.demo_table
      FOR ALL COLUMNS 
      COMPUTE STATISTICS
    
    Analyze the specified columns for a table
    ANALYZE TABLE $scratch.demo_table
      FOR COLUMNS (state, city) 
      COMPUTE STATISTICS
    

    Dropping a Table

    note:

    When dropping an Apache Iceberg table:

    • For external Nessie data sources, the DROP TABLE command logically removes a table from the source. Even though the table is removed from the catalog, it still physically exists in storage until garbage collection removes it.

    • For Amazon S3 data sources, the DROP TABLE command logically removes a table from the source and physically removes all files associated with the table. After the table is dropped, it is permanently deleted and cannot be restored.

    • For AWS Glue data sources, the DROP TABLE command removes a table from the catalog. The datafiles are not deleted from the warehouse.

    Syntax
    DROP TABLE <table_path>
    

    Parameters

    <table_path>

    String

    The path of the table that you want to drop. The table can be in the scratch directory or a data lake source.

    Examples

    Drop a table
    DROP TABLE $scratch.demo_table