Transforming Data on Load
Dremio allows data transformations during loading with the COPY INTO SQL command, simplifying the ETL process by enabling simple transformations as data is ingested.
Common use cases for this functionality include:
- Selecting a subset of columns from source files to write into the target table.
- Reordering columns in source files to match the order in the target table.
- Converting data types from source files into target table type, which is most common with CSVs.
- Applying simple elementwise functions, which would require additional processing later.
Supported File Formats
Dremio's COPY INTO SQL command with transformations supports Parquet and CSV file formats.
Select a Subset of Columns
Imagine we receive a file that contains the available inventory in each store every day; however, the file also includes item details that we don't need. We need to load only a subset of the data and exclude information that does not help build our data product. Since the provided CSV file does not include headers, we must set EXTRACT_HEADER to FALSE (see file_column_number for more information). This then allows us to reference the columns by their order in the file using the following query:
COPY INTO inventory(store_id, stock_date, item_id, amount)
FROM (SELECT $1, $5, $2, $6 FROM '@mystage/storeinfo.csv')
FILE_FORMAT 'csv'
(EXTRACT_HEADER FALSE);
Reorder Columns
Sometimes, the data in the underlying file does not match how you want to materialize the data in Dremio. In the example below, we have a CSV snapshot of our customer data, and the cust_id column is the fifth column in the file. As a best practice, we want to move the table's primary key to the first column of the table. Because the CSV file includes headers, we must reference the columns by their header names as defined in the file. To achieve this, we load the data with the following:
COPY INTO dim_customer(cust_id, name, address, age)
FROM (SELECT cust_id, name, address, age FROM '@mystage/customer.csv')
FILE_FORMAT 'csv';
Convert Data Types
If you want to convert the source data types of your files into target types, you can add that to your SELECT statement as well.
COPY INTO dim_customer(cust_id, name, address, age)
FROM (SELECT CAST(cust_id AS INTEGER), name, address, age FROM '@mystage/customer.csv')
FILE_FORMAT 'csv';
Supported Functions
- ARRAY_APPEND
- ARRAY_CAT
- ARRAY_COMPACT
- ARRAY_DISTINCT
- ARRAY_INSERT
- ARRAY_PREPEND
- ARRAY_REMOVE
- ARRAY_REMOVE_AT
- ARRAY_SIZE
- ARRAY_SLICE
- ARRAY_TO_STRING
- ASCII
- BASE64
- BIN
- BINARY_STRING
- BROUND
- BTRIM
- CASE
- CAST
- CEILING
- CHARACTER_LENGTH
- CHAR_LENGTH
- CHR
- COALESCE
- CONCAT
- CONCAT_WS
- CONVERT_FROM
- CONVERT_REPLACEUTF8
- CONVERT_TIMEZONE
- CONVERT_TO
- CURRENT_DATE
- CURRENT_DATE_UTC
- CURRENT_SCHEMA
- CURRENT_TIME
- CURRENT_TIMESTAMP
- DATEDIFF
- DATETYPE
- DATE_ADD
- DATE_DIFF
- DATE_PART
- DATE_SUB
- DATE_TRUNC
- DAY
- DAYOFMONTH
- DAYOFWEEK
- DAYOFYEAR
- ENCODE
- EXTRACT
- FLOOR
- FROM_HEX
- HASH
- HASH64
- HEX
- HOUR
- ILIKE
- INITCAP
- ISDATE
- IS [NOT] FALSE
- IS [NOT] NULL
- ISNUMERIC
- IS [NOT] TRUE
- IS_BIGINT
- IS_INT
- IS_MEMBER
- IS_SUBSTR
- IS_UTF8
- IS_VARCHAR
- LAST_DAY
- LCASE
- LEFT
- LENGTH
- LIKE
- LOCALTIME
- LOCALTIMESTAMP
- LOWER
- LPAD
- LTRIM
- MASK
- MASK_FIRST_N
- MASK_HASH
- MASK_LAST_N
- MASK_SHOW_FIRST_N
- MASK_SHOW_LAST_N
- MD5
- MINUTE
- MONTH
- MONTHS_BETWEEN
- NEXT_DAY
- NORMALIZE_STRING
- NOW
- NULLIF
- OCTET_LENGTH
- PARSE_URL
- QUARTER
- REGEXP_COL_LIKE
- REGEXP_EXTRACT
- REGEXP_REPLACE
- REPLACE
- REVERSE
- RIGHT
- ROUND
- RPAD
- RTRIM
- SECOND
- SET_UNION
- SHA, SHA1
- SHA256
- SHA512
- SPLIT_PART
- STARTS_WITH
- STRING_BINARY
- SUBLIST
- SUBSTR
- SUBSTRING
- SUBSTRING_INDEX
- TIMESTAMPADD
- TIMESTAMPDIFF
- TIMESTAMPTYPE
- TOASCII
- TO_CHAR
- TO_DATE
- TO_HEX
- TO_NUMBER
- TO_TIME
- TO_TIMESTAMP
- TRIM
- TRUNCATE
- TYPEOF
- UCASE
- UNBASE64
- UNHEX
- UNIX_TIMESTAMP
- UPPER
- WEEK
- WEEKOFYEAR
- YEAR
Unsupported Capabilities
The SELECT statement used for transforming data in the COPY INTO SQL command has the following restrictions:
WHEREclauseORDER BYstatementGROUP BYstatementJOIN- aggregate and window functions
Unsupported file types
The COPY INTO with transformations capability does not support JSON, AVRO, or ORC file types.