20.0.0 Release Notes (December 2021)

Breaking Changes

A new logback.xml file is included as part of Dremio 20.0’s new structured logging functionality. This file is included with every Dremio installation/upgrade files and is typically skipped during installation. However, with Dremio 20.0 your original logback.xml file must be overwritten with the file provided in the installer. If you do not use the new file provided with the upgrade, then Audit Logging will not work and queries.json will remain empty.

What’s New

Audit Logging

For organizations subject to compliance and regulation where auditing is regularly required, Dremio now offers full audit logging. With this log file, all user activities performed within Dremio are tracked and traceable via the audit.*.json. Each time a user performs an altering action within Dremio, such as creating an object or running a query, the audit log captures the user’s ID and username, object(s) affected, action performed and event type, SQL statements used, and more.

By default, audit logging is enabled and stored in the same location as all other log files.

Aggregation Spilling in All Cases (Preview)

Previously, Dremio spilled to disk when performing all aggregation operations, with two exceptions: 1) when calculating the approximate count distinct of a column and 2) when a minimum or maximum was applied to a string column. If you processed more data than could be handled by the system’s available memory, customer queries would fail due to a lack of sufficient memory needed to complete the query.

These calculations, min/max on string column (generally available) and NDV() (preview), have been moved to the vectorized hash aggregation spill operator. Now, in the event of a query requiring more memory than is presently available in the system, the operator containing these calculations will spill data to disk as needed, thus allowing the query to continue processing and ultimately complete.

To use the NDV() function with the vectorized hash aggregation spill operator, enable the support key: exec.operator.vectorized_spill.ndv.

Power BI SSO Support

Support now exists for single-sign (SSO) on using an organization’s Power BI credentials with Azure Active Directory (AAD) as an identity provider (IdP). As part of this functionality, AAD gives the Dremio service a JSON web token (JWT) at the end of the Azure AD OAuth flow, after which Dremio verifies the token and authorizes a user session until its associated expiration.

Ranger Row Filtering & Column Masking

For Hive sources with Apache Ranger authorization configured, Dremio now offers full support of external column-masking and row-filtering via Ranger security policies. This functionality offers fine-grained control over the previous whole-table/view access controls, local row permissions, and column masking in queries offered historically. Using the Ranger external security service, Dremio now enforces external policies at query runtime.

The following filtering/masking options are supported:

  • Row Filtering
    • Valid WHERE clauses on the table
  • Column Masking
    • Redact - Replaces all alphabetic characters with x and all numeric characters with n.
    • Partial mask: show last 4 - Displays only the last four characters of the full column value's.
    • Partial mask: show first 4 - Displays only the first four characters of the full column value's.
    • Hash - Replaces all characters with a hash of the entire cell's value.
    • Nullify - Replaces all characters in the cell with a NULL value.
    • Unmasked (retain original value) - No masking is applied to the cell.
    • Date: show only year - Displays the year portion of a date string, defaulting the month and day to 01/01.
    • Custom - Specifies a custom column masked value or valid Dremio expression. Custom masking may not use Hive UDFs.

Microsoft Azure Synapse Analytics Support

An ARP connector is now available on Dremio that allows for integration with Azure Synapse Analytics dedicated SQL pools. This option is available for immediate use by adding a new External Source from the Dremio interface.

Logback Updated

Logback was updated to v1.2.9 to mitigate CVE-2021-44228. This utilizes a new version of the library, which disables certain JNDI features known to cause issues with log4j 2.x. While Dremio is not vulnerable due to logback configurations being inaccessible externally and not using JNDI/JDBC features, this was done as a general security best practice.

Other Enhancements

  • As of v20.0, Dremio now supports JDK 11 for on-premise installations. YARN and AWSE are not supported. Docker images will be available for both JDK 8 and 11.
  • When deleting a user from Dremio, the username or email address associated with the record will display in the confirmation message.
  • When reading data from MongoDB, users may now set the batch size for reading data via the Sample Size source setting. Simply enter a custom value to indicate the number of documents Dremio must sample to determine the schema. Additionally, users may also specify if the sample should occur from the beginning or end of the collection.
  • Dremio users may now create nested roles, or child roles assigned to a parent role. These nested roles inherit of the privileges set at the parent level in addition to those granted specifically to the nested role. This allows for even more fine-grain access management for users based on role type. Currently, this may only be done via the SQL editor using the GRANT ROLE TO ROLE command.
  • For organizations using ADLS v2 sources, Dremio now supports adding whitelisted containers using AAD credentials without the need for Azure role-based access control (IAM role). Only permissions to access the container (read and write) must be set. From the source’s Settings dialog, under the Advanced Options tab, users may set a specific directory inside a container using AAD credentials wherein subdirectories of that path may be accessed using only read permissions or read/write access (read/write must be granted at the container levels at minimum, or also the end directory to add sources). The path must follow the format of container_name/dir0/.../dir_name.
  • Dremio now offers an expression splitting cache, which helps to avoid performing splitting work for the same expression repeatedly. This allows for the separation of actual data from the instructions regarding how to handle these splits, the main benefit being to reduce your bandwidth significantly. This cache may be enabled or disabled using the exec.expression.splits_cache.enabled support key. By default, this functionality is enabled for all organizations that upgrade to 20.0.
  • A new column is available on the Job Profile page under the Phase section, which now allows you to see peak memory consumed by incoming buffers.
  • Added a new environment variable to the dremio-env file (DREMIO_GC_LOG_TO_CONSOLE="no") to configure whether garbage collection sends messages only to the console or logs. If set to "yes", the DREMIO_LOG_DIR environmental variable is ignored and GC logs are sent only to the console. If set to no, logs are instead sent to the log file.
  • Updated Dremio’s supported version of the Azure.Storage.Common library to v12.14.1, at the recommendation of Microsoft. Organizations using older versions of Azure storage libraries occasionally encountered data corruption issues, which is addressed with the newer SDK version.

Deprecations

Mixed Types Support Key Disabled

In v18.0, support for mixed data types became deprecated. However, the support key to continue using mixed types was left active for users to prepare more fully for this transition. As of Dremio 20.0, the support key for mixed data types is disabled and may no longer be used from the Support Keys page.

Fixed Issues

Users attempting to obtain Oracle row counts noticed a significant delay.
This issue has been addressed so that the Oracle RDBMS source will now use table statistics to determine the row count of a table, provided this information is present and not stale. If this fails, then Dremio will revert to the slower COUNT(*) query.

Users encountered error messages with MongoDB and Elasticsearch plugins due to nodes being unable to copy.
This issue has been addressed so that users may now copy nodes without triggering error messages.

Users attempted to run queries with a join clause on an Oracle datasource, but JDBC read them as individual queries for each table despite the clause.
This issue has been addressed by pushing down TO_DATE(timestamp) and TO_CHAR(numeric, formatStr) for RDBMS sources.

When attempting to query the sys.privileges table with large catalogs, users encountered an error about Dremio being unable to get the profile for the job.
This issue has been addressed so that users may now successfully query the sys.privileges table.

For customers using PostgreSQL, users encountered the error ERROR: collations are not supported by type "char" when selecting columns of the CHAR data type.
This issue has been addressed so that when selecting columns of the CHAR data type with PostgreSQL, users will no longer receive an error about unsupported collations.

When querying Oracle, customers received an error stating Invalid row type due to an inability to detect the data type.
This issue has been addressed so that retrieving the Oracle ROWID columns will no longer trigger an error, but properly retrieve it as VARCHAR.

Dremio would return the error DATA_READ ERROR: Failure while attempting to read from database when a query was submitted with unsigned integer types.
This issue has been addressed so that MySQL unsigned integer types are now mapped as bigint to allow for the full range of possible values.

Customers couldn’t query min/max variable length fields on datasets due to query failure. These failures were caused by insufficient memory due to the group by clauses being unable to spill.
This issue has been addressed by adding variable length fields to the vectorized hash aggregator operator, which allows spills.

Users encountered issues with splits in aggregates when encountering an expression blocked by agg-join pushdowns.
This issue has been addressed by adding a normalizer rule that better-matches aggregate reflections against queries grouped by expressions.

User queries encountered Gandiva exceptions indicating that Dremio “could not allocate memory for output string."
This issue has been addressed by fixing an unexpected behavior within the SPLIT_PART function.

Users encountered issues where converting OR and IN clauses caused issues when expressions were used."
This issue has been addressed by adding support to handle cases of converting OR/IN clauses with expressions.

Organizations using JSON sources encountered errors when a NULL field was encountered."
This has been addressed by not projecting any fields in Dremio with a NULL value.

Organizations with 1000+ users encountered noticeable load delays when attempting to use the user filter drop-down from the Jobs screen."
This has been addressed by optimizing the drop-down so that users are loaded rapidly without any performance issues.