2.1 Release Notes
Enhancements
Acceleration
Approximate count distinct acceleration
Dremio now supports accelerating count distinct queries based on an approximation based algorithm (HLL). This provides a faster and more memory efficient way of providing distinct counts. This is especially useful in high cardinality scenarios with very large datasets. Internally, this capability uses Dremio’s NDV()
function.
Ability to define “Selected Measures” for aggregation reflections
Dremio now supports the ability to set/unset measures per field when setting up aggregation reflections. Previously, Dremio would include COUNT
, AVG
, SUM
, MIN
and MAX
as set measures by default for any field. You can now set/unset these (including Approximate count distinct
) separately per field with the defaults being set for the COUNT and SUM measures.
External reflection support for RDBMS data sources
External reflections are now supported for all Dremio sources, including RDBMSs.
Datasets can now be configured to have associated reflections never expire
Previously, you needed to set a really high expiration value to avoid reflections expiring.
A “Never Expire” option was added to make this more explicit.
Web Application and APIs
Space creation is now admin only by default
Added a new support setting ui.space.allow-manage
that overrides the default, allowing non-admin users to manage spaces.
REST API-based security management
Improved the Dremio Catalog API to allow managing security on all endpoints.
Improved source form UI experience
Improved workflow in the UI to streamline defining new data sources. Added logically separated sections that are intended to make the experience of adding new data sources easy.
Allow resetting of advanced options in the admin UI
Added ability to reset advanced options to their defaults in the UI.
Allow disabling of file uploads
Added support to disable the file upload functionality. This allows admins to monitor space usage and enforce quotas around it.
Job Profiles now show physical bytes read Job Profiles now show physical bytes read instead of Arrow in-memory representation size.
Cmd+click and middle button click on New Query opens query editor in a new browser tab This fixed an issue where “cmd+click” is supposed to open a new query screen in a new window, but was incorrectly opening it up in the same window.
Coordination and Metadata
Query Planner Changes
Depending on the query, performance characteristics may be impacted specifically in two main ways:
-
Correlated queries (subquery comparison to a top query): Dremio may now do fewer JOINS and scans of tables so it should improve performance.
-
Join optimizations: Dremio now groups multiple joins together and orders them in the most efficient way allowing us to capture more use cases. Please note the order of joins in your plan may be different and in many cases this will improve performance, but could result less-than-optimal performance for other edge cases.
New Dremio YARN Autobundler: The YARN bundle is now auto-generated at runtime.
- The installation tarball is now half the size of what it used to be and therefore it should be quicker to download.
- Any file added to Dremio’s Classpath will automatically be copied to YARN now.
Allow experimental use of transitive join predicate optimization
Filtering on join key now pushes down on both sides of the join to minimize the records read and then try to join. This can be enabled by the following support key planner.experimental.transitivejoin
.
Source Adapters
Elasticsearch 6
Support for Elasticsearch version 6, which is currently the latest version.
Hive ORC Performance
Improved performance and memory characteristics when working with ORC files that reside in Hive sources. Dremio now also better handles predicate pushdowns.
Support for SSL encryption when connecting to Oracle
Added support for connecting to Oracle over TLS using the JDBC thin driver. This is certified to work with Oracle Database 12c Release 1 (12.1) and (11.x.0.x). However, it is not certified to work with older, unsupported database releases, such as 10.2.x, 10.1.x, 9.2.x, and 9.0.1.x.
Added support for S3 buckets in AWS GovCloud Added support for S3 buckets in AWS GovCloud as a source and as distributed storage for reflections.
Compression support for Elasticsearch traffic Added support for compressing Elasticsearch responses to minimize network traffic.
Improved stats estimation for Hive tables
Due to frequent problems with bad/outdated table stats on Hive tables we changed the default to rely on Dremio’s own table statistics estimates and not Hive’s.
Execution
Improved small file performance on S3/ADLS
Enhanced performance for small files on high latency stores like S3 & ADLS.
Simplified memory configuration via DREMIO_MAX_MEMORY_SIZE_MB
in dremio-env
Setting this configuration will allow Dremio to automatically determine the best allocation between HEAP a DIRECT memory depending on the node type.
Bug Fixes
Coordination and Metadata
Cannot re-add a previously deleted user
A previously created user that is subsequently deleted can now be re-created.
Join optimization not working as expected in certain scenarios Join optimization has been improved in scenarios where projects or filters exist between joins.
Filters specified as a part of <t> JOIN <t> ON [...]
clause would not be evaluated before join operation
Fixed scenario where a filter condition on an inner join was not pushed below the join.
Hive tables with large amount of partitions would take up too much space in Dremio’s metastore
Improved meta-data handling for Hive tables with large number of partitions.
ODBC slowness with Tableau in some situations
Improved performance of schema retrieval for Tableau users.
Some changes to source configuration causes datasets contained to loose formatting
Resolved an issue where an invalid source config causes existing datasets to loose formatting.
Incorrect join output record estimation causes query to be over-parallelized
Resolved issues with join planning that causes queries to be run for longer with excessive resource requirements.
Date/time functions are not pushed down as expected
Improved handling of CURRENT_DATE
, CURRENT_TIMESTAMP
, CURRENT_TIME
to ensure optimal pushdowns of date/time functions.
Acceleration
Reflection job completion is not visible in reflection history
After a reflection refresh finishes successfully it now shows up in the reflection’s job history with the proper reflection id.
UI incorrectly shows non-bigint fields as options for incremental update columns
Currently the UI suggests incremental refresh on date and timestamp for Mongo and Elastic sources, but fails to refresh associated reflections with error.
Creating a reflection on a table with decimal fields will fail
Dremio no longer fails expanding the materialization when creating a reflection on a table with decimal fields
Reflection matching would occasionally fail with a silent StackOverflowError
Fixed an issue where a StackOverflowError would get generated when executing a query using the RANK function, if a matching reflection is present.
Acceleration page might appear hung without an indication of work-in-progress
Fixed issue where the reflections screen has no progress icon while loading reflections.
Reflections are incorrectly marked as invalid despite successful creation
Some reflections are marked as invalid due to NPE in JoinAnalyzer
code path, despite being successful. This issue has ben fixed.
Fixed issue where reflection matching sometimes doesn’t match valid joins
Improved JOIN reflection substitution matching to consider a wider set of alternatives.
Failures when matching a reflections affects other reflections for that query
Resolved an issue where a failure happening in post-substitution processing resulted in all substitutions failing to be registered with the planner. Now, if the failure is only happening during the processing of a single candidate only that candidate is excluded.
Some queries using ANSI JOIN syntax may not be accelerated
Resolved an issue where a query against a table with an INNER JOIN and no corresponding WHERE clause is NOT accelerated.
Web Application and APIs
UI incorrectly showed wrong port when creating HDFS sources
Changed the default port for HDFS showing up in the UI from 9000 to 8020, which is the default port for HDFS namenode.
Changing refresh policy shows incorrect warning as part of source creation Changing refresh policy no longer shows warning as part of source creation.
No error shown when failing to edit a source
Users are now presented with correct error message.
Data lineage graph is formatted incorrectly
Fixed an issue where the lineage graph wasn’t being rendered correctly.
Cannot download query if using a relative path (via query context)
Fixed issue with using relative paths when downloading a CSV/JSON file gives “Path not found”.
If the PDS a VDS is based on is no longer available, user cannot edit original sql
Fixed issue where if the PDS a VDS is based on is no longer available the user couldn’t edit the original sql
UI slowness when displaying counts for sources with 100K+ objects
UI behavior has been updated to show up to 500 datasets. If number of datasets in a source is 500 or more then the UI will display “500+” or “-”.
Slow list rendering in the UI with sources that have many objects Removed the descendant and dataset count columns in dataset listings to improve UI performance. Both are still accessible per dataset by clicking on the dataset icon to the left of each dataset.
Massive dataset listings in the UI could use too much HEAP memory Performance improvement for massive dataset listing for sources and space
Cannot “Edit Original SQL” when there are errors when querying VDS in UI
The “Edit Original SQL” button should now be visible when VDS returns with error when a PDS doesn’t have certain fields available any more.
Renaming a VDS would sometimes cause error when trying to edit definition
This fixed an issue where altering a VDS definition and then renaming the VDS name breaks the VDS and errors out.
UI doesn’t show error when it cannot reach Dremio coordinators
UI now correctly shows a message when it cannot connect to Dremio coordinators.
Source Adapters
Dremio incorrectly ignores cast to varchar on datetime fields against SQL Server
Dremio will now keep cast to varchar intact when pushing down to SQL Server.
SSL based Elasticsearch sources go unusable if Dremio crashes Fix issue with Dremio connecting to Elasticsearch when self-sign certificates are used in SSL connection.
Fixed timestamp/date/time comparison in Elasticsearch Painless scripts Fix literal comparison for timestamp/date/time types.
Redshift source connections occasionally become stale
Fixed a scenario where we occasionally received a stale connection when using Redshift causing dataset refreshes to fail and return no tables.
Truncate function is incorrectly translated when working with Oracle sources
Translation of the TRUNCATE function in Oracle is now fixed.
Sub-optimal memory characteristics when retrieving aggregation results from Elasticsearch Elasticsearch reader has been optimized to leverage off-heap memory when reading aggregation results
SQL Server date/time field pushdowns lose precision
There was an issue with SQL Server pushdown losing milliseconds precision. These are now handled correctly.
Execution
DAYOFWEEK(…) and its variants incorrectly return Monday as 1, etc.
Currently DAYOFWEEK in Dremio behaves as follows: 1=Monday,…,7=Sunday. Changed it so that 1=Sunday and 7=Saturday.
Start index of 3-arg POSITION() function is treated as zero-based
Fixed start index argument of POSITION function to be 1-based.
Query failure when columns of a dataset being used downstream are changed
Fixed an exception caused when a VDS dependent on another VDS and the columns of the underlying VDS are changed.
Schema learning would not function as expected on Parquet datasets Fixed an issue where the schema learning wasn’t working as expected on some Parquet datasets.
Failure while reading sort spilling files
error when doing sort spilling
Fixed memory leak when reading sort spilling files fails.
Issue with reading parquet page blocks of size > 1024
Dremio can now read parquet page blocks of size > 1024 such as the ones generated by fastparquet
.
Filter pushdown issues when using Impala written Parquet files
Dremio now specifically accommodates varbinary & varchar fields written by Impala during filter pushdowns.
#2.1.4 Release Notes
Bug Fixes
SQL Server sources cannot be removed under some scenarios and would result in Dremio hanging
Dremio now allows SQL server to be removed successfully; even after a restart.
If the user had any unsaved changes “Edit Original SQL” would sometimes not work as expected
‘Edit Original SQL’ should now go into edit mode for the latest version of the dataset.
Issue on Hive 1.2.1 with SQL Standard-based Authorization enabled where trying to fetch the functions ‘get_functions’ would fail
Dremio now avoids fetching the get_function function via modification to the Hive code.
A JOIN query with FILTER that gets pushed down to Oracle data sources can take a very long time to execute
Joins with filter now complete in a more reasonable amount of time.
Dremio cluster can become unstable while upgrading to 2.0.11 or 2.1.2
Improved stability by resolving an issue with the size of the KV store growing unnecessarily.
Accessing Hive transaction tables in ORC format that are bucketed and are type transactional creates unreadable files in HDFS
Dremio can now read/query hive transactional table with bucketing enabled.
Dremio community edition for Windows is unable to execute queries
Fixed an issue with the communication between coordinator and executor nodes where the query planning would complete, but execution would never begin.
Creating external reflection would give “unexpected error occurred” message
Dremio now allows external reflections to be successfully created.
Agg reflections can fail with error “Failure while reading sort spilling files, on mapr cluster”
Fixed Agg Reflections failure due to sort spill file I/O issue on MapR.
#2.1.6 Release Notes
Bug Fixes
RAW reflections in version 2.1.4 were not being matched properly
Raw reflections are now successfully matched and used for acceleration.
Agg reflections in version 2.1.4 were not being substituted as expected
Agg reflections are now successfully substituted as part of the query plan.
Dragging the mouse in the Histogram UI screen doesn’t behave properly
Moving the mouse on the Histogram screen now allows the chart to be moved as expected.
If a VDS name contains ‘/’, the upgrade from 2.0.5 to 2.1.4 will fail.
Fix upgrade issue caused by VDS name containing ‘/’
Compatibility with CentOS 6 in Dremio v2.1 is broken
Fixed CentOS 6 compatibility issue with a jna library built with GLIBC2.14.
Reindexing on startup after a crash was slow performing
Improved the performance of reindexing on startup after a crash.
Rpc Exception system errors were being caused by out-of-memory
Resolved an issue due to a memory leak.