Skip to main content

COPY INTO

You can load data from CSV, JSON, or Parquet files that are in a source into an existing Apache Iceberg table.

caution

For information about how the COPY INTO <table> operation works, as well as the supported storage locations and requirements, see Copying Data Into Apache Iceberg Tables.

Syntax

Syntax
COPY INTO <table_name>
[ AT { REF[ERENCE] | BRANCH } <reference_name> ]
FROM '@<storage_location_name>[/<path>[/<file_name>] ]'
[ FILES ( '<file_name>' [ , ... ] ) |
REGEX '<regex_pattern>' ]
[ FILE_FORMAT 'csv' | 'json' | 'parquet' ]
[ ( [csv_format_options] | [json_format_options] ] | [parquet_format_options] ) ]
csv_format_options
[ DATE_FORMAT '<string>' ]
[ EMPTY_AS_NULL [ '<boolean>' ] [, ...] ]
[ ESCAPE_CHAR '<escape_character>' ]
[ EXTRACT_HEADER '<boolean>' ]
[ FIELD_DELIMITER '<character>' ]
[ NULL_IF ( '<string>' [, ...] ) ]
[ ON_ERROR { 'abort' | 'continue' } | 'skip_file' } ]
[ QUOTE_CHAR '<character>' ]
[ RECORD_DELIMITER '<character>' ]
[ SKIP_LINES <n> ]
[ TIME_FORMAT '<string>' ]
[ TIMESTAMP_FORMAT '<string>' ]
[ TRIM_SPACE [ '<boolean>' ] ]
json_format_options
[ DATE_FORMAT '<string>' ]
[ EMPTY_AS_NULL [ '<boolean>' ] [, ...] ]
[ NULL_IF ( '<string>' [, ...] ) [, ...] ]
[ ON_ERROR { 'abort' | 'continue' } | 'skip_file' } ]
[ TIME_FORMAT '<string>' ]
[ TIMESTAMP_FORMAT '<string>' ]
[ TRIM_SPACE [ '<boolean>' ] ]
parquet_format_options
[ ON_ERROR { 'abort' | 'skip_file' } ]

Parameters

<table_name> String

The name of the table, which can also include the necessary qualifier if the table is not in the current context. Example: Samples.'samples.dremio.com'.'NYC-taxi-trips'


AT { REF[ERENCE] | BRANCH } <reference_name> String   Optional

(Applicable only to Arctic catalogs) Specifies the reference at which the table exists. When this parameter is omitted, the current reference is used.

  • REF or REFERENCE: Identifies the specific reference.
  • BRANCH: Identifies the specific branch.

'@<storage_location_name>[/<path>[/<filename>] ]' String

Specifies the file system location to load files from. The location provided must be a preconfigured Dremio Source such that only the storage location name and file path need to be provided.

  • The location path can be a directory or a specific file name.
  • If a specific file is provided then the COPY INTO operation works only on that file.
  • If instead a directory path is provided without a file name then the operation operates on all files in the directory including subdirectories recursively.

[ FILES ( '<file_name>' [ , ... ] ) | REGEX '<regex_pattern>' ] String   Optional

Specifies a list of files to be loaded in the FILES clause. The files must already have been staged in the storage location specified in the FROM clause. The statement will error when one or more file specified is not found or not accessible. <file_name> is of format [<directory_name>]/<file_name>.

Alternatively, specifies a regular expression pattern string in the REGEX clause, enclosed in single quotes, specifying the file names, paths, or both to match. Path is matched relative to the storage location. The expression must be in a Perl-compatible format.

note

For best performance, avoid applying patterns that filter on a large number of files.

  • You can use either FILES or REGEX, but not both.
  • The maximum number of file names that can be specified is 1000.
  • This clause is mutually exclusive with the [/<file_name>] option in the previous clause.

[ FILE_FORMAT 'csv' | 'json' | 'parquet' ] String   Optional

The format of the file or files to copy data from.

  • If not specified, the file format is determined based upon the file extension, which means that the file extension is required if FILE_FORMAT is not specified.
  • To load files that do not have an extension, FILE_FORMAT must be specified.
  • All files loaded in a COPY INTO operation must be of the same file format.

You can use uncompressed or compressed CSV and JSON files. Compressed files must be in the gzip format, using the .gz extension, or in the bzip2 format, using the .bz2 extension.


csv_format_options String

Options that describe the formats and other characteristics of the source CSV file or files.


json_format_options String

Options that describe the formats and other characteristics of the source JSON file or files.


parquet_options String

Options that describe the formats and other characteristics of the source PARQUET file or files.

note

Only the ON ERROR option is supported for Parquet source files.

CSV Format Options

DATE_FORMAT '<string>' String   Optional

String that defines the format of date values in the data files to be loaded. If a value is not specified, YYYY-MM-DD is used. See the section 'Date/Time Formatting' in Date/Time for more format elements.


EMPTY_AS_NULL [ '<boolean>' ] String   Optional

Boolean that specifies whether an empty string is considered a NULL field or an empty string. If a value is not specified, TRUE is used.


ESCAPE_CHAR '<escape_character>' String   Optional

A single character used as the escape character for the character specified by QUOTE_CHAR. The escape character provides a way to include the QUOTE_CHAR character inside a string literal by modifying the meaning of that character in the string. If a value is not specified, " is used.


EXTRACT_HEADER '<boolean>' String   Optional

Boolean that specifies if the first line in the CSV is a header. If a value is not specified, TRUE is used. If SKIP_LINES <n> is also specified and EXTRACT_HEADER is set to TRUE, the n+1 line in the file is considered to be the header.


FIELD_DELIMITER '<character>' String   Optional

The single character used to separate fields in the file. If a value is not specified, , is used.


NULL_IF ( '<string>' [, ...] ) String   Optional

Replace strings in the data load source with NULL.


ON_ERROR { 'abort' | 'continue' | 'skip_file' } String   Optional

Specifies what to do if an error is encountered during the loading process.

  • 'abort'
    Specifies that the COPY INTO operation should immediately abort if it encounters an error. This is the default option.

  • 'continue'
    Specifies that the COPY INTO operation should continue if it encounters an error.

    Every record that cannot be parsed in a CSV file is rejected; however, all records that can be parsed in the file are processed.

    When the copy operation is complete, the output displays the number of records loaded into the Iceberg table and the number of rejected records.

    If there is at least one rejected record, you can query the system table SYS.PROJECT.COPY_ERRORS_HISTORY to find out information about the job that ran the COPY INTO operation.

    To obtain information about rejected records in particular files, query the SYS.PROJECT.COPY_ERRORS_HISTORY system table to obtain the ID of the job that ran the COPY INTO operation. Then, use the copy_errors() function in a SELECT command, specifying the job ID and the name of the target table.

  • 'skip_file'
    Specifies that the COPY INTO operation should stop processing the input file at the first error it encounters.

note

The 'skip_file' option requires extra processing on the input files, regardless of the number of errors the input files contain. As a result, the 'skip_file' option is slower than the 'abort' and 'continue' options. Skipping large files due to a small number of errors can delay the COPY INTO operation.

The first potential error is registered in the SYS.PROJECT.COPY_ERRORS_HISTORY system table, which you can query to get information about the job that ran the COPY INTO operation.

To get information about rejected records in particular files, query the SYS.PROJECT.COPY_ERRORS_HISTORY system table to obtain the ID of the job that ran the COPY INTO operation. Then, use the copy_errors() function in a SELECT command, specifying the job ID and the name of the target table.

The 'skip_file' option differs from the 'continue' option as follows:

  • The 'skip_file' option does not insert any rows from an input file that contains an error, whereas the 'continue' option may insert correct rows.

  • The 'skip_file' option only registers the first error in an input file and stops processing, whereas the 'continue' option may register several errors in the same file.


QUOTE_CHAR '<character>' String   Optional

The single character used to quote field data within a record. The default is a double-quotation mark.


RECORD_DELIMITER '<character>' String   Optional

One or more characters that separate records in an input file. Accepts common escape sequences. If a value is not specified, \r\n is used.


SKIP_LINES <n> Integer   Optional

Number of lines to ignore at the beginning of each input file. If no value is given, no lines are skipped. Must be a non-negative integer. If SKIP_LINES <n> is specified and EXTRACT_HEADER is also set to TRUE, the n+1 line in the file is considered to be the header.


TIME_FORMAT '<string>' String   Optional

String that defines the format of time values in the data files to be loaded. If a value is not specified, HH24:MI:SS.FFF is used. See the section 'Date/Time Formatting' in Date/Time for more format elements.


TIMESTAMP_FORMAT '<string>' String   Optional

String that defines the format of timestamp values in the data files to be loaded. If a value is not specified, YYYY-MM-DD HH24:MI:SS.FFF is used. See the section 'Date/Time Formatting' in Date/Time for more format elements.


TRIM_SPACE [ '<boolean>' ] String   Optional

Boolean that specifies whether or not to remove leading and trailing white space from strings. The default is FALSE.

JSON Format Options

DATE_FORMAT '<string>' String   Optional

String that defines the format of date values in the data files to be loaded. If a value is not specified, YYYY-MM-DD is used. See the section 'Date/Time Formatting' in Date/Time for more format elements.


EMPTY_AS_NULL [ '<boolean>' ] String   Optional

Boolean that specifies whether an empty string is considered a NULL field or an empty string. If a value is not specified, TRUE is used.


NULL_IF ( '<string>' [, ...] ) String   Optional

Replace strings in the data load source with NULL.


ON_ERROR { 'abort' | 'continue' | 'skip_file' } String   Optional

Specifies what to do if an error is encountered during the loading process.

  • 'abort'
    Specifies that the COPY INTO operation should immediately abort if it encounters an error. This is the default option.

  • 'continue'
    Specifies that the COPY INTO operation should continue if it encounters an error.

    Every record that cannot be parsed in a CSV file is rejected; however, all records that can be parsed in the file are processed.

    When the copy operation is complete, the output displays the number of records loaded into the Iceberg table and the number of rejected records.

    If there is at least one rejected record, you can query the system table SYS.PROJECT.COPY_ERRORS_HISTORY to find out information about the job that ran the COPY INTO operation.

    To obtain information about rejected records in particular files, query the SYS.PROJECT.COPY_ERRORS_HISTORY system table to obtain the ID of the job that ran the COPY INTO operation. Then, use the copy_errors() function in a SELECT command, specifying the job ID and the name of the target table.

  • 'skip_file'
    Specifies that the COPY INTO operation should stop processing the input file at the first error it encounters.

    note

    The 'skip_file' option requires extra processing on the input files, regardless of the number of errors the input files contain. As a result, the 'skip_file' option is slower than the 'abort' and 'continue' options. Skipping large files due to a small number of errors can delay the COPY INTO operation.

    The first potential error is registered in the SYS.PROJECT.COPY_ERRORS_HISTORY system table, which you can query to get information about the job that ran the COPY INTO operation.

    To get information about rejected records in particular files, query the SYS.PROJECT.COPY_ERRORS_HISTORY system table to obtain the ID of the job that ran the COPY INTO operation. Then, use the copy_errors() function in a SELECT command, specifying the job ID and the name of the target table.

    The 'skip_file' option differs from the 'continue' option as follows:

    • The 'skip_file' option does not insert any rows from an input file that contains an error, whereas the 'continue' option may insert correct rows.

    • The 'skip_file' option only registers the first error in an input file and stops processing, whereas the 'continue' option may register several errors in the same file.


TIME_FORMAT '<string>' String   Optional

String that defines the format of time values in the data files to be loaded. If a value is not specified, HH24:MI:SS.FFF is used. See the section 'Date/Time Formatting' in Date/Time for more format elements.


TIMESTAMP_FORMAT '<string>' String   Optional

String that defines the format of timestamp values in the data files to be loaded. If a value is not specified, YYYY-MM-DD HH24:MI:SS.FFF is used. See the section 'Date/Time Formatting' in Date/Time for more format elements.


TRIM_SPACE [ '<boolean>' ] String   Optional

Boolean that specifies whether or not to remove leading and trailing white space from strings. The default is FALSE.

Parquet Format Options

ON_ERROR { 'abort' | 'skip_file' } String   Optional

Specifies what to do if an error is encountered during the loading process.

  • 'abort'
    Specifies that the COPY INTO operation should immediately abort if it encounters an error. This is the default option.

  • 'skip_file'
    Specifies that the COPY INTO operation should stop processing the input file at the first error it encounters.

    note

    The 'skip_file' option requires extra processing on the input files, regardless of the number of errors the input files contain. As a result, the 'skip_file' option is slower than the 'abort' and 'continue' options. Skipping large files due to a small number of errors can delay the COPY INTO operation.

    The first potential error is registered in the SYS.PROJECT.COPY_ERRORS_HISTORY system table, which you can query to get information about the job that ran the COPY INTO operation.

    To get information about rejected records in particular files, query the SYS.PROJECT.COPY_ERRORS_HISTORY system table to obtain the ID of the job that ran the COPY INTO operation. Then, use the copy_errors() function in a SELECT command, specifying the job ID and the name of the target table.

Output

The command returns this output:

Column nameDescription
Rows InsertedThe number of rows loaded from the source data files.

Examples

Copying a CSV file into a table and using an ON_ERROR clause
COPY INTO context.myTable
FROM '@SOURCE/bucket/path/folder'
FILES ('fileName.csv')
(ON_ERROR 'continue')
Copy all files present in a folder
COPY INTO context.MyTable
FROM '@SOURCE/bucket/path/folder/'
FILE_FORMAT 'json'
Copy a specific file in a directory
COPY INTO context.myTable
FROM '@SOURCE/bucket/path/folder/file1.json'
Copying files using a regular expression pattern string matching filename
COPY INTO context.myTable
FROM '@SOURCE/bucket/path/folder'
REGEX '.*.csv'
Copying files using a regular expression pattern string matching filename and path
COPY INTO 'context.myTable'
FROM '@SOURCE/bucket/path/folder'
REGEX '^2020-11-1[2-3]/.*/'
FILE_FORMAT 'json'
Copying files using a regular expression pattern string in the specified file format
COPY INTO context.myTable
FROM '@SOURCE/bucket/path/folder'
REGEX 'part.*'
FILE_FORMAT 'json'
Copying a list of files present on a Dremio source
COPY INTO context.myTable
FROM '@SOURCE/bucket/path/folder'
FILES ('foo.csv', 'dir/bar.csv')
Copying files containing DATE and TIME types
COPY INTO context.myTable
FROM '@SOURCE/bucket/path/folder'
FILES ('date_time_data_file1.csv', 'date_time_data_file2.csv')
(DATE_FORMAT 'DD-MM-YYYY', TIME_FORMAT 'HH24:MI:SS')
Copying CSV files and replacing certain entries with NULL, if encountered
COPY INTO context.myTable
FROM '@SOURCE/bucket/path/folder'
FILES ('file1.csv', 'file2.csv', 'file3.csv')
(NULL_IF ('None', 'NA'))
Copying with the following transformations: Trimming any leading and trailing whitespaces for each entry, and treating empty values as NULL
COPY INTO context.myTable
FROM '@SOURCE/bucket/path/folder'
(TRIM_SPACE 'true', EMPTY_AS_NULL 'true')
Copying CSV files into a table with RECORD_DELIMITER and FIELD DELIMITER clauses
COPY INTO context.myTable
FROM '@SOURCE/bucket/path/folder'
FILE_FORMAT 'csv'
(RECORD_DELIMITER '\n', FIELD_DELIMITER '\t')
Copying a CSV file into a table with an ESCAPE_CHAR clause
COPY INTO context.myTable
FROM '@SOURCE/bucket/path/folder'
FILES ('fileName.csv')
(ESCAPE_CHAR '|')