3.0 Release Notes
High Performance Parallel Exports (CTAS)
The use of high performance parallel exports allow users to create, reorganize, download and export large (>1 million rows) or small datasets from any source of data into any of the CTAS-supporting data sources within Dremio. When users employ the CTAS statement, Dremio will then store the results of the query into one or many parquet files (depending on the size of the source) on which users have full control on the naming, destination path and security rules. Like any dataset created within Dremio, the results of the high performance parallel exports will be cataloged and made searchable so users can easily find, share, and collaborate. See Tables for more information.
CTAS supports all filesystem source types (S3/ADLS/NAS/HDFS/MapR-FS, etc.) using the filesystem permissions for the written table using impersonation.
CTAS is enabled on a per source basis during source creation through the Dremio UI. This enables functionality on the source connector using impersonation for permissions which allows WRITE access to the source via SQL-based commands.
Enhanced ALTER PDS SQL Command
ALTER PDS REFRESH METADATA command has been updated to support additional, optional
behaviors for managing metadata. See Dataset SQL Statements
for more information
Enhanced Connector Framework
With this release, Dremio is utilizing improved relational connectors with the SQL Server and Postgres connectors. This provides enhanced performance and extensive push-down capabilities. This logic also considers UDTs (User Defined Types) to be unknown types and skips them.
Wikis and Tags
The Wiki feature allows you to add rich content (text, images, etc…) for a Space (and its datasets) or a Source (and its datasets). The Tag feature allows you to create and assign tags to all datasets. See Data Curation for more information.
Dataset Catalog tab
The Dataset Catalog tab provides a single location for you to manage all dataset context and metadata. It allows you to create and manage Wiki content and Tags for all datasets. See Data Curation for more information.
Apache Ranger Support (Enterprise only)
Dremio now offers Ranger Based authorization for Hive. This authorization method checks the Ranger policy permissions for the end user logged into Dremio and then allows/disallows access as defined by the Ranger policy. See Ranger authorization in Hive for more information.
You enable the Hive authorization client when you add a new Hive Source to Dremio.
ODBC / JDBC Wire Encryption (Enterprise only)
Dremio now supports using TLS/SSL for encrypting communication between ODBC and JDBC clients and server (coordinator). You enable TLS/SSL on coordinators via the configuration file. The clients use connection properties to enable TLS/SSL. See Using Wire Encryption for more information.
Intra-cluster Wire Encryption (Enterprise only)
Dremio now supports using TLS/SSL for encrypting communication between nodes. You enable TLS/SSL on all coordinators and executors via the configuration file. See Using Wire Encryption for more information.
AWS S3 IAM Role-based Access
Dremio now supports IAM role-based access to S3 buckets. On top of using access key/secret, S3 sources can now use IAM roles from EC2 instance metadata for access.
Preview-only features are disabled by default.
The Gandiva feature supports efficient evaluation of arbitrary SQL expressions on Arrow buffers using runtime code generation in LLVM. It uses LLVM tools to generate and compile code that makes optimal use of underlying CPU architecture. By combining LLVM with Apache Arrow libraries, Gandiva can perform low-level operations on Arrow in-memory buffers that are highly optimized for specific runtime environments. See Gandiva-based Execution for more information.
- Improved resource utilization
Faster, lower-cost operations of analytical workloads
LLVM tools are a set of modular compiler tools that deal with code generation. They are used to compile and execute arbitrary expressions efficiently (instead of interpreting them). In the Dremio context, this is useful for generating code at runtime for two SQL operators that deal with arbitrary user expressions such as Project and Filter.
To request access to this feature, please send an email to email@example.com
Workload Management (Preview and Enterprise only)
The Workload Management feature improves workload management via user-defined job queues. These queues that are associated with different resource constraints and flexible assignment rules for assigning user jobs into these queues.
Workload Management is displayed in the Dremio UI on the Admin console and the Queues and Rules sections allow you to manage your queues and rules.
To request access to this feature, please send an email to firstname.lastname@example.org
As of Dremio 3.0, when you issue queries against files or folders, the default behavior is to not auto-promote files/folders to datasets. Prior to Dremio version 3.0, Dremio's default behavior for filesystem-based sources (HDFS, S3, NAS, etc) was to automatically auto-promote a folder or file to a dataset when you run a query on a file or folder.
Multi-Role Nodes in Cluster Deployment
Configuring Dremio in C/E mode (coordinator and executor instances on the same node) is deprecated for cluster deployments. Multi-roles are only supported in single-node installations.
Starting with Dremio 3.0, any use of Hive scalar function are deprecated.
Reflection refresh reattempts can cause duplication or incomplete datasets.
Fixed some situations where a reattempt of a reflection refresh could cause duplicated or incomplete datasets in rare cases.
Query reattempts can cause duplication or incomplete datasets.
Fixed some situations where a reattempt of a query run from the UI could cause duplicated or incomplete datasets in rare cases.
Clicking 'Edit Original SQL' link gave an error when a requested dataset
was renamed or moved after it was copied.
Fixed. Copying a dataset now starts with a clean history.
Incorrect comparison of CHAR values.
String literals in SQL queries are now treated as VARCHAR, rather than CHAR types. This allows consistent behavior when dealing when string literals in case statements and equality filters.
Apache Parquet logs are written to a different location and not configurable.
Resolved by integrating the Parquet logs with the internal logging system. This results in all logs being centrally located.
When a reflection job is cancelled, the temporary directory created for materialization is not removed.
Resolved the cleanup process when cancelling reflections jobs.
Edit Original SQL button sometimes hangs or causes confusion regarding the dataset version.
The logic for the Edit Original SQL button has been fixed. If users have unsaved changes in their current query at the time that they click on this button, they wil be prompted to save or abandon their changes.
The Preview button is obscured by the dataset history tooltip in certain situations.
Fixed the user interface.
The Never Refresh checkbox for reflections doesn't display as checked in the
Community Edition UI.
The Never Refresh checkbox for reflections now works correctly.
Loading a canceled job gave an a 'doesn't exist' error which was not descriptive enough.
Fixed by providing a new error message: "Could not load results as the query was canceled".
Previewing JSON files with union types causes a NullPointerException.
Previewing JSON files with a schema that includes a union of different types (for example: string and integer) exposes an issue in the underlying Arrow UnionReader, causing a NullPointerException in Dremio.
Updating virtual dataset definitions via the REST API causes metadata issues.
Updating the SQL of a VDS would not correctly refresh the metadata of the VDS. For example, the list of fields would not update.
Hive queries occasionally fail when all partitions are pruned.
A valid query against a Hive table could cause multiple planner exceptions and the query to fail if all partitions are pruned during planning.
The Dremio Hive source setting to control zerocopy is not taken into
account to enable/disable ORC zerocopy.
This setting is now applied correctly.
Hive ORC transactional tables that have not been compacted will report an incorrect row count.
In particular, if the table has never been compacted, it will report 0 records, which results in sub-optimal query plans.
Running a preview query against a Oracle table might run longer.
Fixed by pushing down the limit clause to the Oracle source.
Dremio is unable to connect to an Oracle source if the password contains special characters._
When saving the source, user would previously get a "Invalid Oracle URL specified" error message.
Index out-of-bound exceptions occur with Parquet files.
Sometimes an index out-of-bounds exception occurs when reading Parquet files that contain decimal columns and have missing columns.
The error message, "Self-suppression not permitted" occurs when establishing HDFS connection fails.
When the HDFS client queues requests and a connection fails to be established, all of the requests receive the same exception instance: "Self-suppression not permitted". Fixed the "Self-suppression not permitted" issue.
When running COUNT on a SQL Server data set that is larger than 2,147,483,647 rows, the source will return
an arithmetic overflow error.
Resolved by pushing down the COUNT() aggregate function as COUNT_BIG() in SQL Server.
When retrieving a folder ID with REST API catalog/by-path, the returned ID sometimes utilized quotes incorrectly. Resolved by normalizing the ID during validation.
With SQL Server, if one of the source tables is very small and a query with
a join is performed, the reflection does not substitute properly.
Fixed by improving the multi-join normalization operation.
If Dremio runs out of memory, an exception occurs in the PartitionedCollector.
The buffers already allocated are not released which causes a memory leak.
Resolved by implementing auto rollback and closing all buffers after running out of memory.
Upgrading fails if a dataset name contains the forward slash (/).
This occurred because Dremio used the forward slash to delineate between the dataset version string and the path.
Fixed the upgrade so that forward slashes in dataset names are no longer an issue.
With SQL Server sources, aggregate reflections containing UNIQUEIDENTIFIER columns
fail with "The JDBC storage plugin failed while trying to setup the SQL query".
UNIQUEIDENTIFIER columns now resolve to Dremio VARBINARY types instead of VARCHAR types to resolve this error.
There is a CentOS6 issue due to incompatibility with the JNA library (4.4.0).
Resolved compatibility issue.
With MapR sources, aggregate reflections fail with the "Failure while reading sort spilling files" error.
Underlying issue resolved.
For a non-partitioned Hive table, an incorrect split key is generated. Resolved by showing the correct information in the exception message when an error occurs.
When MySQL returns large amounts of data in response to a query, the connection will timeout.
This is because there is a property 'net_wait_timeout' that defaults to 30 seconds, unless set by the JDBC connection.
Resolved by adding the abiity to set net write timouts on the MySQL JDBC connection.
In some circumstances, changing the setting for DREMIO_MAX_MEMORY_SIZE_MB causes Dremio to fail to start. Resolved the issue.
When refreshing table metadata, sometimes a reflection's table row information is incorrect.
Resolved by updating the accelerator schema.
Dremio's schema learning fails when doing dataset formatting previews
(when converting files/folders to physical datasets).
If you submit a query using the JDBC driver and cancel it from the UI, the query will appear to have successfully completed with no warning or exception.
When starting Dremio (after upgrading to 3.0), most reflections enqueue a refresh job
(typically, this happens only once).
This refresh occurs even if the reflection's refresh interval isn't due or the refection has "never refresh" set. Thereafter, if refresh intervals were set, then all reflections resume their usual refresh cycle.
datetimeoffset data type in SQL Server incorrectly gets the COLLATE clause applied to it.
dremio.jdbc.mssql.push-collation.disable to true to use this field.
3.0.1 Release Notes
Improved Hive Transactional Table Performance
Dremio uses a vectorized reader for splits in Hive-partitioned ORC transaction tables which have no deltas. Splits that have deltas will continue to use the non-vectorized reader.
Encrypted Postgres Connections
Dremio supports encrypted connections to Postgres using SSL. To enable SSL, check the "Encrypt connection" box when creating the source. For further configuration, navigate to Advanced Options tab.
Support for Using Mixed Types (including complex types) in SQL CASE Statements
Dremio supports complex sub-expressions in CASE statements. For example:
CASE WHEN t.x.y = 'me' THEN t.z ELSE 'no' END.
See the Dealing with Mixed Types section in Datasets for more information.
Streaming aggregation is not grouping properly within the Window Aggregate query.
This occurs when data is sorted on multiple columns and one of the sort columns is dropped in the plan. This results in dependent columns retaining the original sort order. This issue was resolved by removing the sort order on the dependent columns.
When running a query against a ADLS source,
the DATA_READ ERROR: Error reading data from response stream in positioned read() for file occurs.
This issue is resolved by upgrading the Azure data lake store SDK to version 2.3.2.
Unable to use complex functions on columns of union type with a complex subtype.
Dremio now supports ASSERT_STRUCT and ASSERT_LIST functions to handle complex subtype in a column of union type. For example:
Under certain circumstances when the number of keys in Unpivots is more than 32,
an IndexOutOfBoundsException failure may occur.
Fixed by accessing bit buffers directly and using getNullByteOffset and getNullBitOffset as the offsets to access validities and values.
An non-administrator (unauthorized user) sees an "Unexpected Error Occurred" error message while navigating to the Jobs page.
Fixed the UI behavior and subsequent error message for unauthorized users.
When viewing a folder in a file system source that has more than 1024 files,
the "maxClauseCount is set to 1024" error is displayed.
This error occurs because Dremio has an internal limit for retrieving tags associated with files in a folder. Resolved by retrieving tags for 200 files maximum and displaying a notification, "Tags are only shown inline for the first 200 items.", when that maximum is reached.
In the Dremio UI Wiki, tables do not display properly.
Fixed the issue so that the Markdown tables display properly.
If the Dremio app is killed within the first minute of startup,
an "Unknown source INFORMATION_SCHEMA" error occurs.
This error happens when the internal index is out of sync with the internal store. Dremio now partially reindexes the uncommitted updates.
Unable to save the Dremio UI Wiki for source file system folders.
The Dremio UI Wiki is unavailable for file system folders.
When reading Hive tables in ORC format, heap memory runs out for small tables.
Resolved by setting default options on Hive through the hive-site.xml file.
After adding a JSON file to MongoDB, an SCHEMA LEARNING error occurs when selecting the query run.
Fixed an issue where the reported schema of MongoDB is consistent with the record reader schema when there are complex references.
Joins following an aggregation using a min/max of a string column might cause query failures.
The query failure is usually manifested as an array out of bounds exception. In rare cases, the failure might cause the executor node itself to fail.
IOException error occurs when starting up Dremio web server with SSL
When the web server is configured to use custom certificates, truststore was previously optional. This behavior regressed in 3.0.0 and the truststore is required. The truststore is optional in 3.0.1.
3.0.5 Release Notes
Amazon Elasticsearch Service support.
Dremio now supports querying Amazon Elasticsearch Service (version 5.x, 6.0, 6.2, and 6.3).
Improved query performance when using LIMIT on large queries.
Dremio now more efficiently cancels rest of query execution when the needed amount of values are returned from a query based on the LIMIT clause.
Improved CPU resource utilization and balancing on execution nodes.
Execution logic has been improved to better handle longer running queries when the system load is not high.
Dremio doesn't work properly when one or more Parquet files in a directory have zero (0) record files.
Dremio works correctly in the circumstance when some files have zero (0) records and some files do have records.
When running a CTAS command against a filesystem configured to use impersonation, the files created
by Dremio executors are owned by the same user as the Dremio process, and not by the user who ran the query.
This issue is resolved by ensuring that during directory creation time, table and directory ownership are correct.
By default, the PostgreSQL JDBC driver caches the entire query results into memory.
This means that when doing a table scan on large tables, it is easy to run out of memory.
Resolved this issue by setting auto-commit to off when creating a connection to PostgreSQL so that the JDBC driver properly limits the amount of memory being used by the fetch size.
For Hive, ACID tables cannot be read when
hive.exec.orc.zerocopy is enabled.
Resolved this issue by fixing a Hive improper byte starting position when slice covers two (2) or more zerocopy buffers.
Slow join performance when doing joins using decimal keys. Joins using decimal keys are now handled through our vectorized join operation, therefore, resulting in higher performance.
Excessive heap memory usage when working with large, complex queries.
In certain cases with large/complex queries, query plan instructions coordinators send to the executor nodes could result in excessive heap memory usage. This mechanism has been improved to be more heap efficient.
AssertionError: Relational expression rel# error when joining VALUES table (on the left side) and a
When running a JOIN between a VALUES table (on the left side) and a JDBC table, planning will fail. This issue is now fixed.
Excessive planning time when working with Hive tables when the query has many threads.
This would happen in decently sized cluster (many cores) when working against Hive tables that have many partitions. Resolved with improved memory management logic across threads.
Query scans extra columns when there are window functions.
Planning logic has been updated to only scan the relevant columns.
Direct memory wouldn’t get cleaned up after completing a query on Hive ORC tables.
Memory allocation and clean-up logic has been updated to correctly handle this scenario.
When a window is resized, at some point double vertical scrollbars appear on the Users page.
At a certain height of the window, the scrollbars might cause page flickering.
Fixed the issue.
3.0.6 Release Notes
Gathering schema and table information from relational sources takes too long
Fetching schema and table information is now more efficient and takes less time when adding new relational sources.
When viewing a folder in a non-files system source that has more than 1024 files,
the "maxClauseCount is set to 1024" error is displayed.
This has been resolved for all affected sources.
Pushdowns into Oracle with identifier names longer than 30 characters would fail.
Queries would fail with error: “The JDBC storage plugin failed while trying setup the SQL query”. Dremio now rewrites aliases longer than 30 characters for Oracle to avoid errors when pushing queries to Oracle.
Format previews did not work when a directory has 'hidden' files
(files starting with an underscore in the file name).
Resolved by ignoring period and underscores in files when performing format previews.