21.0.0 Release Notes (March 2022)
Breaking Changes
1. Partition Value Format Difference from Dremio 20.x
Partition format resulting from CTAS
operations is different in Dremio 21.0.0 than in Dremio 20.x. CTAS
creates partition folders named prefix_partitionValue
and writes the column dir0
with value
= partitionValue
in the parquet files. A parquet file will have the same value for the partition column in all row groups.
Partition format is done the same way with or without unlimited splits, which are enabled by default in Dremio 21.0.0, but these tables are interpreted differently depending on whether or not unlimited splits is enabled. When disabled, the table is treated as partitioned because:
- There are intermediate directories between the root folder and the parquet files.
- All values of one or more columns in individual parquet files are the same.
- The partition value is equal to
value_in_parquet_file
.
Such columns are called implicit partition columns.
With unlimited splits enabled, Dremio doesn’t recognize implicit partition columns. A table is partitioned if there are intermediate directories and the partition values are equal to the directory names.
As an example, note the results of the same CTAS operation in Dremio 20.x versus Dremio 21.0.0: create table $scratch.date_partition1 partition by (dir0) as select * from <data_source>
SQL | 20.x Result | 21.0.0 Result |
---|---|---|
select dir0 from $scratch.date_partition1 | date_col=2022-03-29 | 0_date_col_2022_03_29 |
In 20.x, dir0
values are date_col=yyyy-mm-dd from parquet files. In Dremio 21.0.0, dir0
values are directory names.
Under what conditions will this issue occur?
This issue will occur with CTAS
on FileSystem datasets that have dir0
columns and CTAS
uses partition by
on these dir0
columns. Tables created this way will have different dir0
column data from the source dataset because Dremio 21.0.0 is using directory names for values instead of values from parquet files.
Are there any workarounds?
- Since the new value is a variation of
number_old-value
, you can create a view to parse the new value and extract the old value. - Recreate the
CTAS
statement and rename the partition column to avoid conflict withdir0
from both data file and directory name. - Recreate existing datasets using Iceberg format, and note that
CTAS
also needs to use Iceberg format.
note:
Performance can be negatively impacted by the first two workarounds.
2. PDFS is not supported for distributed storage in versions 21.0.0 and above.
Additionally, with this change, the Helm chart no longer supports using the “local” distributed storage option.
What’s New
Support for Apache Arrow Flight SQL
You can use Apache Arrow Flight SQL client-server protocol to develop client applications that access your data through Dremio. For more information, see Developing Arrow Flight SQL Client Applications for Dremio.
Common Sub-Expression Elimination
Dremio’s query engine has been enhanced to better-handle repeating sub-expressions in a query. With this change, common sub-expressions are now computed once per query, and the results made available as needed with each reference in a single query. Previously, Dremio would compute these sub-expressions each time a query referenced them, causing additional resource consumption.
Native Vectorized Copiers
Native copiers are now available with Dremio Enterprise and Community Edition for 2-byte, 4-byte, 6-byte, and conditional 6-byte selection vectors. This replaces the original Java-based field buffer copiers with a more efficient copy mechanism for primitive data types, such as bigInt, bool, string, among others. Faster vectorized copiers allows for measurable overall performance improvements, such as more efficient system throughput, reduced CPU usage, and shorter query times.
This functionality is enabled by default for operators that copy memory.
Other Enhancements
- This release includes a number of UI changes and enhancements, as well as query performance improvements.
- Apache Arrow has been upgraded to 7.0.0. The upgrade fixes a number of general query and query performance issues.
- To improve performance, the default fraction of cores considered by Dremio executors during query execution planning has been increased from 0.7 to 0.75. This change may cause a slight increase in memory usage for some queries due to the increased parallelism.
- Unlimited splits for FileSystem and Hive sources is now enabled by default.
- The peak memory usage shown in the operator profile has been updated to show the maximum of memory reserved and memory used.
- The query engine has been enhanced to identify and eliminate identical sub-expressions within a query.
- Iceberg now supports metadata functions for inspecting a table’s history, snapshots, and manifests.
- Improved logging and now providing a more meaningful error message when invalid characters are encountered in a password or PAT.
- The Amazon Elasticsearch Service source has been rebranded to Amazon OpenSearch Service.
- This release includes two new system tables,
sys."tables"
andsys.views
, which contain metadata for tables and views in Dremio. To see table or view information, runselect * from sys."tables"
orselect * from sys.views
.
note:
The name of the table (sys."tables") must be encapsulated in quotes so that it is parsed as the table name instead of the reserved keyword table.
- PageHeaderWithOffset objects will be excluded from the heap when reading Dremio Parquet files. Instead, column indexes will be used to optimize performance and reduce heap usage when generating page headers and stats.
- Changes to roles (create, update, delete) are now captured in the audit log.
- Ownership can now be granted on all catalog items to a user or role using
GRANT OWNERSHIP ON <object> TO <user or role>
.
note:
If a role owns a view that queries data in a Hadoop source, and if the source has impersonation disabled, the query will fail because only users can be used to connect to impersonation-enabled sources.
- Improved type coercion by performing an implicit cast, where possible, when data types differ, allowing for better relation between different data types. Some examples include union of types
numeric
andvarchar
, castingvarchar
todate
, and join of types typesnumeric
andvarchar
.
- In this release, Dremio is now pushing down computation for extra hash join conditions.
- SQL Server and other ARP sources can now enable a flag to have boolean expressions expanded to numeric expressions when they do not support true boolean values.
- The query plan cache is now enabled by default.
Issues Fixed
- A query with
not in
was returning incorrect results if more than two values were in predicate for certain Hadoop and Hive datasets.
- In environments with high memory usage, if an expression contained a large number of splits, it could eventually lead to a heap outage/out of memory exception.
- At times, the Jobs page became unresponsive when selecting the User filter. The list of users will now be limited to 50 names, and users can filter with the embedded search box to find a maximum of 1000 users.
- Previous Dremio versions allowed ACLs that used the username as the userid, which would result in invalid ACLs. In this release, such ACLs will be pruned and not displayed to users.
- Fixed an issue that was causing sockets to remain in a
CLOSE_WAIT
state while running metadata refresh on an ORC dataset. This resulted inToo Many Open File
errors and the cluster had to be restarted to resolve the condition.
- Some complex join filters were getting dropped, resulting in incorrect query results.
- Fixed an issue with metadata refresh that could result in incorrect results or query exceptions due to an expected row count mismatch.
- Queries with
except
(LogicalMinus) were failing/not being handled correctly in the plan serializer.
- In previous versions of Dremio, for some relational sources that did not support
boolean
type, using theCAST
function to expand a boolean value to a boolean expression was resulting in anIncorrect syntax near the keyword 'AS’
error.
- The links to details were not working for some of the jobs under the history » link for Reflections.
- In some cases, if a Parquet file in a Delta Lake table had many row groups,
count(*)
queries were failing due to a divide by 0 exception.
- Fixed a column index issue in RelMetadata that was resulting in some queries on views failing with
VALIDATION ERROR: Using CONVERT_FROM(*, 'JSON')
.
- In the Query Visualizer Execution tab, Max Batches and Max Memory have been changed to Records Processed and Memory Processed.
- The same
SELECT
query, using theIS_MEMBER()
function, was returning different results in different versions of Dremio.
- In cases involving multiple tables in joins along with filters, RDBMS query pushdown could result in queries that ambiguously reference columns, resulting in
invalid identifier
errors.
- In some cases, the idle timeout was being interpreted as milliseconds instead of seconds, leading to excessive cleanup of connections.
- In some queries, window expressions were not getting normalized after substitution, resulting in a
Cannot convert RexNode to equivalent Dremio expression
error.
- If every value in one column of a MongoDB physical dataset was an empty array, queries were failing with a
Schema change detected
error. To address this issue, Dremio properly eliminates columns that would result in aNULL
data type when doing schema inference from the Mongo records.
- Running
select *
on some system tables was failing with the following error:UNAVAILABLE: Channel shutdown invoked
- When Parquet files contained too many row groups, Parquet metadata was using too much memory and causing outages on the Executor. To avoid this issue, Dremio limits reuse of the Parquet footer when Parquet files contain too many row groups.
- The setting for Nessie retries (nessie.kvversionstore.max_retries) has been removed. There is a new setting for the amount of time to allow for retries (nessie.kvversionstore.commit_timeout_ms). The new setting is in milliseconds.
- Queries that worked in previous versions of Dremio were failing with the following error:
Job was cancelled because the query went beyond system capacity during query planning. Please simplify the query
- The
IS_MEMBER()
function was not working with internal roles, returningfalse
when it should have been returningtrue
.
- The
split_part
function was causing a memory allocation error when the first result was empty.
- Added support for pushing down DATE_ADD and DATE_SUB scalar functions to RDBMS sources.
21.1.0 Release Notes (April 2022)
What’s New
- You can now specify Iceberg as the format when creating a table with
CTAS
. For example:create table hello store as (type=>'iceberg') as select * from "file.parquet"
- In this release, many UI messages have been updated to provide information that is more accurate and more helpful.
- Improved logging and now providing a more meaningful error message when invalid characters are encountered in a password or PAT.
Issues Fixed
- If you were running 20.x and had unlimited splits/Iceberg support keys enabled, after the upgrade to 21.0.0 you may have seen the error “Failed to get iceberg metadata” when querying datasets. This issue occurred because of how metadata was stored in Iceberg prior to the upgrade.
- The
is_member
SQL function was failing withUnsupportedOperationException
when concatenating with a table column.
- When viewing the Execution profile for a job that had multiple query attempts, a
Profile Fragment is Empty
error was being displayed.
- Max Peak Memory and Phase 00 memory were displaying different memory usage for the same job profile.
- When viewing job details from the Jobs page, the status of some jobs was incorrect in the case of multiple query attempts.
- If unlimited splits were enabled in 20.x and reflections had been created on existing datasets, users may have seen reflection jobs failing with the following error after the upgrade to 21.0.0:
Bad Request (HTTP/400): Unknown type ICEBERG_METADATA_POINTER
21.1.1 Release Notes (April 2022)
Issues Fixed
- For some deployments, the upgrade to 21.1.0 was taking longer than expected.
- An unknown error was being generated when attempting to remove a reflection from the Acceleration dialog and saving the change, and the error would continue to be displayed.
- In this release, json-smart was upgraded to version 2.4.8 to address CVE-2021-27568.
21.1.2 Release Notes (Enterprise Edition Only, May 2022)
Issues Fixed
- For some deployments, the upgrade to 21.1.1 was taking longer than expected.
21.2.0 Release Notes (Enterprise Edition Only, May 2022)
Enhancements
- This release includes a new argument for the
dremio-admin clean
CLI command to purge dataset version entries that are not linked to existing jobs. See Clean Metadata for more information.
- The
-j
argument of thedremio-admin clean
CLI command has been extended to purge temporary dataset versions associated with deleted jobs. See Clean Metadata for more information.
- New commands are available for the
ALTER
keyword. By using theALTER FOLDER
orALTER SPACE
command, you can now set reflection refresh routing at the folder or space level.
Issues Fixed
- Updated the Postgres JDBC driver from version 42.2.18 to version 42.3.4 to address CVE-2022-21724.
- Updated WildFly OpenSSL to 1.1.3.Final to address CVE-2020-25644.
- In this release, json-smart was upgraded to version 2.4.8 to address CVE-2021-27568.
- Partition expressions were not pushed down when there was a type mismatch in a comparison, resulting in slow queries compared to prior Dremio versions.
- Fixed an issue with external LDAP group name case sensitivity, which was preventing users from accessing Dremio resources to which they had been given access via their group/role membership.
- Some IdPs were missing the
expires_in
field in the /token endpoint response. Dremio will fall back to theexp
claim in the JWT. If this claim is missing from the JWT, the default expiration timeout will be set to 3600 seconds.
- When a
CASE
was used in aWHERE
filter with anAND
or anOR
, it would be incorrectly wrapped in aCAST
, resulting in the following error:DATA_READ ERROR: Source 'sqlGrip' returned error 'Incorrect syntax near the keyword 'AS'.'
- For some deployments, the upgrade to 21.1.0 or 21.1.1 was taking longer than expected.
- An unknown error was being generated when attempting to remove a reflection from the Acceleration dialog and saving the change, and the error would continue to be displayed.
- Dremio was generating a NullPointer Exception when performing a metadata refresh on a Delta Lake source if there was no checkpoint file.
- A
NULL
constant in reflection definition was causing a type mismatch while expanding the materialization.
- When using Postgres as the data source, expressions written to perform subtraction between doubles and integers, or subtraction between floats and integers, would incorrectly perform an addition instead of the subtraction.
- Fixed an issue that was causing the following error when trying to open a view in the Dataset page:
Some virtual datasets are out of date and need to be manually updated.
- When viewing job details, from the Jobs page or the Run link in the SQL Runner, the status of some jobs was incorrect in the case of multiple query attempts.
- After enabling Iceberg, files with
:
in the path or name were failing with aRelative path in absolute URI
error.
- Some queries were taking longer than expected because Dremio was reading a
STRUCT
column when only a single nested field needed to be read.
- Running
ALTER PDS
to refresh metadata on a Hive source was resulting in the following error:PLAN ERROR: NullPointerException
Known Issues
note:
The following known issues apply to all 21.x releases.
- Following the upgrade to 21.x, values for
grantee
andobject
insys.privileges
may initially be set tonull
. This issue will resolve itself after metadata is refreshed automatically. To resolve immediately, run the following:alter table sys.privileges refresh metadata
- If unlimited splits are enabled, performance can be negatively impacted if datasets contain parquet files with many row groups that are small in size. If this is the case for most parquet datasets, you can set the
exec.parquet.split-size
support key to 128MB or smaller.
- CTAS and reflections that use interval data types are not supported.
- If multiple users are trying to promote the same dataset concurrently, a
CONCURRENT_MODIFICATION_ERROR: Metadata refresh failed
error is displayed, even though the promotion is successful. Additionally, on the Jobs page, concurrent metadata queries may show up as failed, even though the metadata is in place.