Parquet File Best Practices
This topic provides general information and recommendation for Parquet files.
Reading Parquet Files
As of Dremio version 3.1.3, Dremio supports offheap memory buffers for reading Parquet files from Azure Data Lake Store (ADLS).
As of Dremio version 3.2, Dremio provides enhanced cloud Parquet readers. The parquet file readers were re-designed to deliver multiple improvements including: increased parallelism on columnar data, reduced latencies, and more efficient resource and memory usage.
Additionally, the enhanced reader improves the performance of reflections. Implemented for ADLS and AWS S3.
Take into consideration the following limitations when generating and configuring Parquet files. Failure to adhere to these restrictions may cause errors to trigger when using Parquet files with Dremio.
- Maximum nested levels are restricted to 16. Multiple structs may be defined up to a total nesting level of 16. Exceeding this results in a failed query.
- Maximum allowable elements in an array are restricted to 128. The maximum allowable number of elements in an array may not exceed this quantity. Additional elements beyond the allowed 128 results in a query failure.
- Maximum footer size is restricted to 16MB. The footer consists of metadata. This includes information about the version of the format, the schema, extra key-value pairs, and metadata for columns in the file. When the footer exceeds this size, a query failure occurs.
When using other tools to generate Parquet files for consumption in Dremio, we recommend the following configuration:
Implement your row groups using the following:
Note: By default, Dremio uses 256 MB row groups or the Parquet files that it generates.
Implement your pages using the following:
Use a recent Parquet library to avoid bad statistics issues.
|Statistics||Use a recent Parquet library to avoid bad statistics issues.|
|Dictionary Encoding||Do not use. By default, Dremio does not use dictionary encoding for the Parquet files that it generates.|