2.0 Release Notes
When creating reflections on datasets with joins, Dremio now keeps statistics and detects relationships for each join (e.g. 1-1, many-1). If the joins are non-expanding, Dremio can leverage this property to accelerate a larger set of queries. For example, if a user creates a reflection on a dataset that joins a fact table with three dimension tables, given that this reflection meets the above criteria, Dremio can accelerate queries that include any subset of these joins (e.g. fact table joined with just one of the dimension tables), without having to define multiple reflections.
Dremio now supportsexternal reflections, the ability to leverage summary tables or other digests built in external systems within Dremio’s reflection framework. Datasets in any of the data sources that Dremio supports may be leveraged to accelerate queries by Dremio’s cost-based optimizer once defined using SQL commands.
New Reflection Management Engine
The new reflection maintenance management engine provides improved scalability, debuggability and resilience. The engine automatically optimizes prioritizing, ordering, and queueing reflection refreshes, as well as sophisticated error recovery. New reflection management framework also features:
- Better handling of missing dependencies: Dremio is now resilient to scenarios where a reflection may fail to refresh because the data source is down, or if the reflection is on a empty table.
- Better handling of cyclical dependencies: If the user creates two reflections that can substitute for one another, Dremio ensures that only one should substitute for the other.
New reflection system tables
Reflection information can now be programmatically accessed using the following tables:
Reflections now pickup schema updates to underlying datasets in many cases
For example, Dremio will automatically update reflections definitions when changing column types or droppping columns not referenced in a reflection.
Reflection REST API
Users can now create and manage reflections using the new Reflection REST API.
Improved reflection statuses
Individual reflections now offer more detailed status information. This includes whether a reflection is ready to be used for acceleration, status of refreshes associated with that reflection, information about refresh failures and whether there were schema changes to the underlying dataset after a reflection was created.
Support for enabling/disabling individual reflections
Reflections can now be enabled/disabled individually. The reflection administration page has also been overhauled and now lists reflections individually, grouped by dataset.
Manual reflection refresh
Users can request an immediate refresh of all reflections that depend on a given dataset from the UI and REST API.
Propagation of reflection refresh interval changes
Datasets in a data source will now inherit the refresh interval settings from its parent source whenever the settings for that source change. This behavior is disabled if users make changes on the dataset directly.
Ability to skip reflection recommendations
Reflection recommendation generation can now be skipped.
Mixed type fields can now be used as part of reflections
Dremio now allows users to use mixed data type field as part of reflection definitions.
Improved reflection suggestions based data profile
Reflection suggestion logic has been updated to provide better recommendations for a variety of data profiles.
Reflections no longer rely on moves to support atomic updates
Reflection creation logic has been updated to not rely on moves to support atomic updates. This also affects CTAS queries.
Web Application and APIs
Error highlighting in SQL Editor
The SQL editor in Dremio’s UI now includes error highlighting capabilities.
SQL REST API
Users can now execute queries in Dremio using the new SQL REST API. Once submitted, queries can be polled for progress and once completed their results can be retrieved.
Catalog REST API
Users can now browse the Dremio catalog as well as create and modify sources, spaces, folders and datasets using the new Catalog API.
Job REST API
Users can now get status and results for a specific job using the new Job API.
Dataset Votes (EE only)
Admins now have the ability to see all Datasets that have votes to understand acceleration demand and dataset popularity.
Jobs Page now lists enqueued status
Jobs page now includes enqueued, planning and running statuses, instead of listing them all as running.
Coordination and Metadata
Improved ODBC/JDBC metadata performance
Metadata calls from ODBC and JDBC clients should now perform better due to caching and retrieval optimizations.
Improved INFORMATION_SCHEMA retrieval performance
INFORMATION_SCHEMA queries should now perform better due to caching and retrieval optimizations.
Improved handling of system, Java and environment variable for YARN deployments
Users can now separately specify system, Java and environment variables as a part of deploying executors via YARN. Previously, environment variables could not be passed in YARN deployments.
Ability to start Dremio in the foreground
Dremio’s startup script has been updated to support running in the foreground. This can be accessed using
dremio start-fg command.
Imroved systemd support
Dremio now packages
tmpfiles.d file in its RPM installation, which ensures systemd to re-create
/var/run/dremio after system restart.
Improved partition pruning planning
During query planning, partition pruning evaluation is cached for improved planning performance.
Faster S3 source connections
Improved mechanism for S3 bucket cataloging to provide faster initial S3 connection times.
Ability to use LDAP group RDNs as simple names
Dremio now has the ability use group RDNs as simple names when using LDAP integration.
LDAP group query optimization
Admins can now optionally enable
pruneUnreachableGroups option under Group Attributes in LDAP configuration file (
ad.json) to optimize LDAP group queries.
Improved metadata change probing
Better detection of metadata changes that require full metadata updates results in faster metadata update times due to decreased for larger metadata operations.
Java 8 is now required to run Dremio
Dremio now requires Java 8. Java 7 is no longer supported.
In cluster deployments a given node may only have a single role Multiple roles per node are not supported in cluster deployments. A node has to be either a coordinator or an executor.
Amazon S3 upload performance and memory improvements
Performance and memory usage characteristics when creating reflections on Amazon S3 has been greatly improved.
Improved source configuration change impact warnings
Dremio now warns users when they make configuration changes to source that will cause existing reflection, format and sharing settings to be cleared. Dremio also avoids doing additional operations if the changes are non-metadata impacting, leading to faster update times.
Add support for ALTER SOURCE REFRESH STATUS
Users can now trigger the refresh of a source’s status through a SQL command.
Automated source status monitoring and recovery
Dremio now monitors status of data sources and reports on problematic states as well as frequently attempting to re-connect to the source.
Option to ignore Elasticsearch scroll result count mismatches
Occasionally, Elasticsearch would return incorrect number of reported hits. By default, Dremio fails such queries to protect against incorrect results. There is now an option to disable behavior for Elasticsearch sources.
Automatic query retry when Elasticsearch alias definition changes
Dremio now monitors validity of Elasticsearch aliases and automatically updates its metadata at query time if a change is found. Dremio than re-executes the query with the new metadata.
Automatic query retry when RDBMS schema changes are detected
Dremio now monitors changes to RDBMS schemas and automatically updates its metadata at query time if a change is found. Dremio than re-executes the query with the new metadata.
Optimized IN clause handling
IN clause performance has been greatly increased through both Dremio execution level optimizations as well as better query planning for queries that include IN clauses.
Executor internal CPU containerezation
Number of CPU cores that an executor can access can now be limited by Dremio. This is available for all deployment models (YARN, bare metal, etc.).
Upgraded to latest version of Arrow (0.9)
Upgraded to latest version of Arrow (0.9). This also moves our decimal memory format to little endian. We’ve also added backwards compatibility support for clients that don’t have support for the latest version of Arrow.
Dictionary encoding is disabled by default for reflections and
To ensure optimal heap usages, dictionary encoding for reflections and
$scratch tables has been turned off by default. This option can be controlled using
store.parquet.enable_dictionary_encoding support key in Admin > Advanced Settings.
Adaptive batch sizing depending on dataset width
Dremio, now by default, dynamically adjusts batch sizes for intra-node communication. This ensures reduced memory usage for wide tables.
Query profiles now include additional planning information
Default query profiles have been updated to include additional planning information.
Improved diagnostic reporting when queries are canceled due to memory limits
We’ve improved the way Dremio accounts for memory and now record a wider set of telemetry such as including node memory usage details in addition to the existing query memory usage.
Improved early query termination support
To optimize and minimize resource usage, Dremio may terminate queries early in some cases. For example, if one side of a join is evaluated to be empty, Dremio will not continue to process the other side.
Restart of master node causes reflection refresh to be started
In some cases, restarting the master/coordinator node used to cause reflections that were not due for refresh to be refreshed pre-maturely.
Deleting datasets may leave orphan reflections
Deleting datasets with reflections, would not remove reflections associated with that dataset in certain cases.
Reflections are sometimes matched but not used when working with joins
In some cases, where portions of a query can be accelerated, Dremio would match reflections but would not use them. Planning logic has been updated to fix this.
Occasional sub-optimal query plans when aggregation reflection are available for a dataset
When working against datasets with aggregation reflections, query performance would sometimes degrade due to bad plan choices. This issue is now fixed.
Reflection suggestion analysis would sometimes fail
Reflection suggestion analysis jobs would sometimes fail when encountering non-UTF8 characters. This is now fixed.
Creating a reflection sometimes fires more than one job for the same reflection
Dremio now guarantees that no more than a single refresh job will be running at a time for a particular reflection.
Coordination and Metadata
INFORMATION_SCHEMA and JDBC/ODBC metadata user-level filtering
Metadata included in both INFORMATION_SCHEMA table and JDBC/ODBC calls are now filtered to only include items that the user has access to view.
Improved isolation of bad data sources
Problematic data sources are now identified faster and do not impact metadata retrieval for other data sources in the system.
Incorrect username case handling when using LDAP
If a user tried to login for the first time using their username in the wrong case, that user’s home space would fail to initialize. This is now correctly handled.
Excessive planning time for queries with many joins when relevant reflections are available
Queries that included many joins, on datasets with reflections defined, could have excessively long planning times. This issue is now fixed.
Distributed storage paths are now only controlled by the master node configuration
Previously, distributed storage paths (i.e.
paths.dist) were determined by the last node launched in the cluster.
Metadata issues when accessing datasets from Microsoft Power BI
Sometimes, if a dataset hadn’t been queried from Dremio before, trying to access it from Microsoft Power BI caused an error. This behavior is now fixed.
Multiple window functions on virtual datasets would cause failures
Queries on virtual datasets that included multiple window functions would cause “400 - Bad request” from Dremio’s server. This issue is now addressed.
Filter simplification optimization causes query planning failures for complex queries
In cases where Dremio’s planner simplifies a set of filters to
false (e.g. x <10 AND x >10), query planning would fail when query plan includes multiple phases. This issue is now addressed.
Provisioning screen would sometimes throw
Version of submitted Cluster does not match stored exception
Provisioning screen would sometimes throw
Version of submitted Cluster does not match stored exception. This is now fixed.
JDBC clients would get blocked indefinitely when receiving invalid message
In rare occurrences when JDBC clients received invalid messages, they were blocked indefinitely. This behavior has been fixed.
dremio.conf references were layered incorrectly
In cases where an option in dremio.conf is referenced by another option, the default value for the referenced option would be used instead of the user-defined version. This issue has been fixed.
NOT (a IN …) syntax would cause query failures
NOT (a IN …) now run without issues.
Compatibility issue between JRuby and JDBC driver
When trying to use Dremio’s JDBC driver from JRuby, users would get
NoClassDefFoundError. This issues has been fixed.
Frequent KVStore flushes cause master/coordinator node performance degradation
Excessive KVStore flushes cause master/coordinator node performance degradation due to increased I/O. This logic has been updated to be more conservative.
Dremio startup command might pick the wrong Java binary
In some cases, even if
JAVA_HOME variable was set, Dremio startup command would pick the wrong Java binary. This logic has been updated to always first check if
$JAVA_HOME/bin/java/ available before searching for alternatives.
Dataset metadata is marked as expired after coordinator node restart
This behavior caused performance degradation for the initial set of queries after a restart due to cost for in-line metadata fetch. This issue has been addressed.
Restart of master node may cause failed jobs to be marked as “In Progress”
These type of jobs are now correctly marked as “Failed”.
$_dremio_$_update_$ field shows up incorrectly in INFORMATION_SCHEMA and ODBC/JDBC metadata calls
For datasets based on file-system sources,
$_dremio_$_update_$ field may show up incorrectly as a part of datasets metadata INFORMATION_SCHEMA and ODBC/JDBC metadata calls. This would cause query failures.
Exception when using OVER clauses in RDBMS sources
Using OVER clauses and Window functions with RDBMS sources would previously sometimes fail with an error
Cannot convert RexNode to equivalent Dremio expression.These types of queries now succeed.
CASE statements on Elasticsearch sources would sometimes handle NULL values incorrectly
Queries including CASE statements on Elasticsearch sources would sometimes handle NULL values incorrectly. This issue is now fixed.
Issue working conflicting types for the same field when using an Elasticsearch alias
Working with nested fields with the same name of different types in an Elasticsearch alias used to cause queries to fail. Dremio now correctly ignores such nested fields.
Incorrect handling of NULLs when pushing down not-equals expression to Elasticsearch
Not equal expressions would previously incorrectly handle NULLs when pushing down to Elasticsearch sources. Pushdown logic has been updated to address this.
Performance issues with Elasticsearch queries including a LIMIT clause
Previously, Dremio would keep fetching until Dremio’s internal batch size was reached. This logic has been updated to terminate once user-requested limit has been reached.
Elasticsearch queries are sometimes under-parallelized
Planning logic for Elasticsearch queries has been updated for a variety of use-cases to ensure optimal parallelization.
Avoid cancelling queries due to out of memory in some large sort queries
When spilling data in external sort, there were cases where we were allocating memory more than we had reserved. We tuned our memory allocation algorithm in sort code to be more adaptive by stepping down the requirement.
Failures related to running FLATTEN with reflections enabled
When using FLATTEN with variable width data, there were cases where we were using incorrect length of variable width data. This resulted in internal failures related to over allocation of memory. The problem is now fixed.
Handle reference count failures in reflection materialization
Some reflection materialization jobs were failing due to improper handling of references to internal buffers. The problem is now fixed.
Query failure when joining maps of maps
When working with nested map type data, the memory for inner data inside the map was being allocated twice for nested maps where we had a map inside a map. The problem is now fixed.
Limit result records for queries “Run” in the UI
When querying large data sets through the UI, Dremio now limits the number of records being returned. The job details would indicate if the limit has been reached.
Better handle deeply nested lists in JSON
When handling lists in Json data, our heuristics to allocate memory for nested lists were not optimal and we ended up making up memory allocation requests more than allowed by OS/JVM. We made a few of changes in our logic to handle this in a better fashion.
Using KVGEN function causes ClassCastException
KVGEN function may fail with a ClassCastException. This issue is now fixed.
If all executors associated with an active job are disconnected, Dremio accidentally marks the job as completed
If all executors associated with an active job are disconnected, Dremio accidentally marks the job as completed. This behavior has been improved to better monitor executor statuses and correctly mark the job as “Failed”.
Incorrect handling of negative decimal values from Parquet files
When working with negative decimal values from Parquet files, Dremio would fail to interpret correctly. This issue is now fixed.
Sort operations does not release partially allocated resources when a failure happens
Sort operation logic has been improved to better handle failures and release memory as needed.
Prepared statements cause excessive heap overhead
Prepared statement caching logic has been updated to clear prepare statement handles in case of heap memory pressure.
_NULL or empty string values are not handled in Reflection and
$scratch table partitioning _
Partitioning logic has been updated to substitute
DREMIO_DEFAULT_NULL_PARTITION__ for NULL partition values and
DREMIO_DEFAULT_EMPTY_VALUE_PARTITION__ for empty partition strings.
Extended blocking operations cause unbalanced thread scheduling
When tasks become unblocked, Dremio will now better evaluate alternative threads that might be able to complete the work more quickly.
New lines are not rendered correctly in query previews and runs
New lines are now correctly handled and displayed.
Dremio UI incorrectly caches folder contents after name changes
If a folder, source or space is deleted and re-created with the same name under the same path, Dremio would show previous listings instead of current listings. This behavior is fixed.
Admin > Administrators page sometimes does not load
When using LDAP, in some cases, Admin > Administrators page sometimes would not load. Error handling logic has been updated to minimize impact of an individual problematic record.
Cannot remove given permissions
In certain cases, administrators would not be able to remove permissions they’ve given to users or groups. This is now addressed.
_Issue running queries without
FROM clause in the UI _
Dremio UI now correctly handles running queries without
FROM clause. For example, a
SELECT 1 query.