Parquet File Best Practices

This topic provides general information and recommendation for Parquet files.

Reading Parquet Files

As of Dremio version 3.1.3, Dremio supports offheap memory buffers for reading Parquet files from Azure Data Lake Store (ADLS).

As of Dremio version 3.2, Dremio provides enhanced cloud Parquet readers. The parquet file readers were re-designed to deliver multiple improvements including: increased parallelism on columnar data, reduced latencies, and more efficient resource and memory usage. Additionally, the enhanced reader improves the performance of reflections. Implemented for ADLS, and AWS S3.

When using other tools to generate Parquet files for consumption in Dremio, we recommend the following configuration:

Type Implementation Row Groups Implement your row groups using the following:
  • A single row group per file.
  • A target of 1MB-25MB column stripes for most datasets (ideally).
Note: By default, Dremio uses 256 MB row groups or the Parquet files that it generates. Pages Implement your pages using the following:
  • Snappy compression.
  • A target of ~100K page size.
Use a recent Parquet library to avoid bad statistics issues. Statistics Use a recent Parquet library to avoid bad statistics issues. Dictionary Encoding Do not use. Dremio, by default, does not use dictionary encoding for the Parquet files that it generates.

results matching ""

    No results matching ""