1.2.2 Release Notes
Compression support for spill operations Dremio now supports using snappy compression for spill operations. This greatly decreases needed space for disk operations.
Improved CONCAT performance CONCAT operations have been improved to have better memory and performance characteristics.
Issue with skipping pages when reading Parquet files
In a few cases, queries would return an exception when reading Parquet files when attempting to skip pages. The reading logic is now updated to be more robust.
Issue with number of open files during spilling operations
Handling of open files has been improved for all spilling operations. Previously, spilling operations could cause queries to fail with out of memory issues in certain scenarios.
Acceleration of join queries containing aggregations
Join queries containing aggregations will now leverage aggregation reflections on either of the queried datasets by default. Previously, this could lead to slower query times due to sub-optimal planning.
Queries on reflections stored locally are under parallelized
Planning logic is improved to address this planning issue. Now these queries will we planned with the expected parallelization.
Issue with reflections generating excessive number of files
In certain cases, Dremio would generate excessive number of files when generating reflections. This is now fixed.
Issue with reflections not matching non-ANSI join syntax
Reflections matching logic is updated to work with all types of supported join syntaxes.
Issue with count(*) queries on Mongo DB sources
Count(*) queries on Mongo DB sources now get pushed down correctly.
1.2.1 Release Notes
New reflection matching algorithm
New reflection matching algorithm. Main improvements:
- Efficient search: More efficient and performant search for matching reflections as queries are submitted.
- All exploring: Explore a larger search space to maximize utilization of existing reflections.
- Better complex query coverage: Better matching of complex query patterns including combinations of multiple types joins, unions, aggregations.
Reflection dependency graph
Dremio keeps track of all reflection dependencies and maximizes use of other reflections when materializing a given reflection. Dependent reflections are scheduled to materialize in sequences to minimize scans of the underlying data source as well as optimize resource utilization in Dremio. Reflection materialization jobs are also more resilient to one-off failures with an intelligent re-attempt mechanism.
Reflection materialization workload management
Administrators can now limit concurrency and memory for different types of reflection materialization workloads based on workload cost. This ensures optimal performance for other workloads, such as ad-hoc and reporting jobs, when materializing reflection and also provides flexibility to accommodate various types of scenarios.
Reflection materialization strategies
Reflections now have two reflection materialization strategies: minimize number of files produced or minimize refresh time. With this added option, reflection materializations can now be optimized for different types of usage goals.
Acceleration refresh policy enhancements
Acceleration refresh policy now supports expiration time. Reflections can now be served until expiration regardless of the refresh interval. The separation of refresh interval from expiration time provides flexibility to serve older reflections in cases of source/refresh related issues.
Improved acceleration diagnostics
Improved diagnostics about all considered, matched and chosen reflections for a given job for easier troubleshooting. This also includes optional tracing that provides fine-grained details for all considered reflections.
Ability to name reflections
Reflections can now be named for easier troubleshooting and identification.
Improved large record performance
Dremio now adaptively re-sizes the amount of records that it works per unit of work based on number of columns, depth of complex fields, as well as value sizes. This greatly improves performance and memory characteristics when working very large records.
Smart spill management
Dremio now analyzes space utilization for all given spill directories and provides configuration for utilization thresholds across all disks. Dremio does not spill to any of the disks above this threshold and returns an exception if all are over-utilized.
Support for union and union all in subsqueries
Union and union all operators are now supported in subqueries.
Improved Parquet writer performance
Parquet writer memory management has been improved to better handle low-memory conditions. This provides an improved memory footprint when materializing reflections. Also, additional logic was introduced to avoid generating parquet files that are too large or too small, which improves performance of queries that use reflections.
Parquet reader enhancements
Parquet reader will now utilize more direct memory instead of heap space whenever possible to reduce heap usage and garbage collection frequency and therefore improve performance. Also, parallelization is now better managed depending on the number of columns.
Sorting memory management enhancements
Sorting operations are now more memory efficient. Dremio now more frequently cleans-up temporary files to reduce overall footprint.
Coordination and metadata
Source metadata caching enhancements
Improved metadata update and caching behavior for data sources. This includes separation of Dataset Discovery (table, schema and dataset names) and Dataset Details (fields, types, shards information, statistics) to provide better query performance with the ability to lazily load detailed metadata for data sources, especially with large catalogs. Also, the new ‘expire after’ parameter specifies the time after the last successful metadata refresh that queries will stop using the old metadata -- after such time, metadata will be fetched in-line with the query.
Improved S3 metadata performance
Optimized and reduced metadata calls when working with S3 sources. Greatly improves metadata retrieval and planning performance when working with datasets that are comprised of large number of files.
Improved performance on large-scale clusters
Optimized and improved node communication when working with 100s of nodes. Substantially improves query startup times and memory footprint.
Improved planning performance when working with many files
Improved performance when working with datasets with very large amount of files (100K+).
Support for configurable ZooKeeper timeout
Ability to configure ZooKeeper timeout for Dremio nodes. This provides flexibility in scenarios where cluster nodes have high CPU utilization or low available memory.
Support for managed Elasticsearch deployments
Ability to whitelist Elasticsearch nodes that Dremio will connect and send requests to. This enables working only with gateway nodes such as in managed Elasticsearch deployment scenarios.
Support for querying across Elasticsearch indexes with type conflicts for the same field
Dremio now supports querying across multiple indexes that have conflicting field types for the same field name. Such fields will be ignored.
Configurable Elasticsearch scroll size
Scroll size for Dremio's requests to Elasticsearch can now be configured.
Improved date/time field pushdowns into Elasticsearch
Support for pushing down MAX/MIN aggregations on date/time columns into Elasticsearch.
High CPU utilization on master node startup due to reflection related tasks
Logic for dealing with reflections on master node startup have been updated to prevent excessive work related to reflection related tasks. This also avoids building of the reflection materialization dependency graph multiple times.
If an incremental materialization fails, then previous unexpired materialization becomes unavailable as well
If a reflection has a valid, non expired, materialization and we fail to refresh it properly we assume the reflection has failed even though we still have a non expired materialization available. This is now fixed.
Incremental refresh issues detecting newly created files
Issue with detection of newly added files in certain cases when using incremental refresh for reflections. Updated logic ensures all newly added files are included.
If a reflection expires while the cluster is down it may not be rematerialized again once the cluster starts
When a cluster starts up Dremio only considers valid reflection for refresh, so if a reflection is already expired it will never get refreshed. This logic is now updated to support out-of-date reflections.
Acceleration of join queries containing aggregations
In this release, join queries containing aggregations will not leverage aggregation reflections on either of the queried datasets by default. This behavior can be enabled using support flag
accelerator.enable_agg_join. In some cases, enabling this flag may cause sub-optimal query plans, resulting in longer query times.
Coordination and Metadata
Slow planning time and query startup time for some JDBC sources
Reduced the number of calls needed to retrieve metadata from some JDBC sources. This improves planning and overall query times.
ODBC client compatibility layer leaking memory in some cases
When an ODBC client does not consume the result set for a query completely and drops the connection, memory may be leaked as data in the send buffer on the server side is not released. The server now releases memory correctly.
Updates to S3 external bucket sources are not reflected in Dremio
When updating an S3 source by adding or removing the bucket, the changes will not be appropriately updated in the browser as data was not refreshed after the creation of the dataset even when the metadata refresh policy intended for a refresh to occur.
Fixes for behaviour of preview cache of query results
The preview of the query results was being cached in such a way that it was not reflecting changes in the underlying data as well as displaying data that was based solely on the query and not also the user.
Hostname changes are causing Dremio cluster configuration issues
Due to affinity settings in Dremio, changes to the hostname could cause issues when querying datasets uploaded to Dremio's distributed cache. Fixed via helpful error messages to guide users to a solution.
Job gets stuck in running state after coordinator restart
On Coordinator restart, any jobs that were in running state would not have their states updated before shutdown resulting inaccessible job profiles for jobs that appear to be stuck in the running state.
Slow count(*) performance
Optimized planning and execution of count(*) queries across all supported sources.
Slow query startup time when a portion of a query is running on newly added executor
Fixed issue with excessive metadata retrieval from data source to improve query startup time when a query is fully or partially running on a newly added executor node.
Schema learning does not take into account multiple files.
When querying multiple JSON files within nested directories on the filesystem, the schema is learned from only the first first file read. This caused certain fields only present in some of the files to appear.
Cannot promote directory of JSON files with changing schema in S3 into Dremio physical dataset
When working with complex JSON files with changing schemas, FLOAT and VARCHAR columns were causing problems when creating physical datasets. This behavior is now fixed.
Excessive memory allocation when reading JSON
JsonRecordReader uses an in-memory map like container to manage records read from the input stream. The container internally has multiple data structures to store data for different fields. The fix allows us to allocate target structures of the size exactly needed by the batch of records we are working with.
Memory Allocation and Reallocation fixes in Arrow
We fixed multiple problems related to memory re-allocation, over-allocation in our in-memory data structures for handling complex JSON schema
CSV files generated for download are not valid in some cases
Dremio now generates CSV files that comply with the CSV RFC. In particular, this includes use of CRLF as delimiter and proper quoting of commas, LFs and CRLFs.
Elasticsearch mapping types with slashes in the name causes query failures
Slashes are now supported in Elasticsearch mapping types.
Browser-specific rendering fixes
These include rendering of checkboxes in Firefox. Window resizing with table data in MS Edge. Remove Reflection button missing in Internet Explorer.
Fix issue with dataset listing under Acceleration Management
Made Admin > Acceleration able to list more than 50 Datasets that have been accelerated or for which acceleration has been requested.
SQL editor fixes
Fixed issues with syntax highlighting of String literals and columns. Made the SQL editor remember the size the user last set it to. The default size is also now taller.