Feedback on high memory usage when merging N parquet files #18833

ndchandar · 2025-11-20T02:29:40Z

ndchandar
Nov 20, 2025

Hello,
I am writing a program that takes N parquet files (where N = 40). Each source parquet file is about ~6 to ~8 MB in size and are Zstd compressed. They are compacted/combined to produce a bigger sized parquet file (~220 to ~250 MB). It appears that we need as much as ~24 GB of memory to have a successful compaction. This gist https://gist.github.com/ndchandar/3900558ff719cefeb8b058e36a18f8be#file-parquet_rewriter-rs-L32-L138 (export_with_datafusion) is the interesting bit. It basically lists all files in a directory, takes N files and compacts them. I tried giving hints to the optimizer that the sources are already sorted but it doesn't seem to help. The row group size is set to 1M.

Giving less memory (E.g 12 or 16 gb), I am running into the below issue

Caused by:
    Resources exhausted: Failed to allocate additional 2.0 MB for ExternalSorterMerge[4] with 49.8 MB already allocated for this reservation - 1826.2 KB remain available for the total pool

I am trying to understand why the spill is not happening efficiently (I am relatively new to DataFusion). Looking for any help/hints to reduce the memory utilization

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feedback on high memory usage when merging N parquet files #18833

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Feedback on high memory usage when merging N parquet files #18833

Uh oh!

Uh oh!

ndchandar Nov 20, 2025

Replies: 0 comments

ndchandar
Nov 20, 2025