Transforming Data on Load
Dremio allows data transformations during loading with the COPY INTO
SQL command, simplifying the ETL process by enabling simple transformations as data is ingested.
Common use cases for this functionality include:
- Selecting a subset of columns from source files to write into the target table.
- Reordering columns in source files to match the order in the target table.
- Converting data types from source files into target table type, which is most common with CSVs.
- Applying simple elementwise functions, which would require additional processing later.
Supported File Formats
Dremio's COPY INTO
SQL command with transformations supports Parquet and CSV file formats.
Select a Subset of Columns
Imagine we receive a file that contains the available inventory in each store every day; however, the file also includes item details that we don't need. We need to load only a subset of the data and exclude information that does not help build our data product. To achieve this, we run the following:
COPY INTO inventory( store_id, stock_date, item_id, amount,)
FROM (SELECT t.$1, t.$5, t.$2, t.$6 FROM '@mystage/customer/.csv' t)
FILE_FORMAT 'csv';
Reorder Columns
Sometimes, the data in the underlying file does not match how you want to materialize the data in Dremio. In the example below, we have a CSV snapshot of our customer data, and the cust_id
column is the fifth column in the file. As a best practice, we want to move the table's primary key to the first column of the table. To achieve this, we load the data with the following:
COPY INTO dim_customer(cust_id, name, address, age)
FROM (SELECT t.$5, t.$1, t.$2, t.$3 FROM '@mystage/customer/.csv' t)
FILE_FORMAT 'csv';
Convert Data Types
If you want to convert the source data types of your files into target types, you can add that to your SELECT
statement as well.
COPY INTO dim_customer(cust_id, name, address, age)
FROM (SELECT CAST(t.$5 AS INTEGER), .$1, t.$2, t.$3 FROM '@mystage/customer/.csv' t)
FILE_FORMAT 'csv';
Supported Functions
- ARRAY_APPEND
- ARRAY_CAT
- ARRAY_COMPACT
- ARRAY_DISTINCT
- ARRAY_INSERT
- ARRAY_PREPEND
- ARRAY_REMOVE
- ARRAY_REMOVE_AT
- ARRAY_SIZE
- ARRAY_SLICE
- ARRAY_TO_STRING
- ASCII
- BASE64
- BIN
- BINARY_STRING
- BROUND
- BTRIM
- CASE
- CAST
- CEILING
- CHARACTER_LENGTH
- CHAR_LENGTH
- CHR
- COALESCE
- CONCAT
- CONCAT_WS
- CONVERT_FROM
- CONVERT_REPLACEUTF8
- CONVERT_TIMEZONE
- CONVERT_TO
- CURRENT_DATE
- CURRENT_DATE_UTC
- CURRENT_SCHEMA
- CURRENT_TIME
- CURRENT_TIMESTAMP
- DATEDIFF
- DATETYPE
- DATE_ADD
- DATE_DIFF
- DATE_PART
- DATE_SUB
- DATE_TRUNC
- DAY
- DAYOFMONTH
- DAYOFWEEK
- DAYOFYEAR
- ENCODE
- EXTRACT
- FLOOR
- FROM_HEX
- HASH
- HASH64
- HEX
- HOUR
- ILIKE
- INITCAP
- ISDATE
- IS [NOT] FALSE
- IS [NOT] NULL
- ISNUMERIC
- IS [NOT] TRUE
- IS_BIGINT
- IS_INT
- IS_MEMBER
- IS_SUBSTR
- IS_UTF8
- IS_VARCHAR
- LAST_DAY
- LCASE
- LEFT
- LENGTH
- LIKE
- LOCALTIME
- LOCALTIMESTAMP
- LOWER
- LPAD
- LTRIM
- MASK
- MASK_FIRST_N
- MASK_HASH
- MASK_LAST_N
- MASK_SHOW_FIRST_N
- MASK_SHOW_LAST_N
- MD5
- MINUTE
- MONTH
- MONTHS_BETWEEN
- NEXT_DAY
- NORMALIZE_STRING
- NOW
- NULLIF
- OCTET_LENGTH
- PARSE_URL
- QUARTER
- REGEXP_COL_LIKE
- REGEXP_EXTRACT
- REGEXP_REPLACE
- REPLACE
- REVERSE
- RIGHT
- ROUND
- RPAD
- RTRIM
- SECOND
- SET_UNION
- SHA, SHA1
- SHA256
- SHA512
- SPLIT_PART
- STARTS_WITH
- STRING_BINARY
- SUBLIST
- SUBSTR
- SUBSTRING
- SUBSTRING_INDEX
- TIMESTAMPADD
- TIMESTAMPDIFF
- TIMESTAMPTYPE
- TOASCII
- TO_CHAR
- TO_DATE
- TO_HEX
- TO_NUMBER
- TO_TIME
- TO_TIMESTAMP
- TRIM
- TRUNCATE
- TYPEOF
- UCASE
- UNBASE64
- UNHEX
- UNIX_TIMESTAMP
- UPPER
- WEEK
- WEEKOFYEAR
- YEAR
Unsupported Capabilities
The SELECT
statement used for transforming data in the COPY INTO
SQL command has the following restrictions:
WHERE
clauseORDER BY
statementGROUP BY
statementJOIN
- aggregate and window functions
Unsupported file types
The COPY INTO
with transformations capability does not support JSON, AVRO, or ORC file types.