You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello,
I am writing a program that takes N parquet files (where N = 40). Each source parquet file is about ~6 to ~8 MB in size and are Zstd compressed. They are compacted/combined to produce a bigger sized parquet file (~220 to ~250 MB). It appears that we need as much as ~24 GB of memory to have a successful compaction. This gist https://gist.github.com/ndchandar/3900558ff719cefeb8b058e36a18f8be#file-parquet_rewriter-rs-L32-L138 (export_with_datafusion) is the interesting bit. It basically lists all files in a directory, takes N files and compacts them. I tried giving hints to the optimizer that the sources are already sorted but it doesn't seem to help. The row group size is set to 1M.
Giving less memory (E.g 12 or 16 gb), I am running into the below issue
Caused by:
Resources exhausted: Failed to allocate additional 2.0 MB for ExternalSorterMerge[4] with 49.8 MB already allocated for this reservation - 1826.2 KB remain available for the total pool
I am trying to understand why the spill is not happening efficiently (I am relatively new to DataFusion). Looking for any help/hints to reduce the memory utilization
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hello,
I am writing a program that takes N parquet files (where N = 40). Each source parquet file is about ~6 to ~8 MB in size and are Zstd compressed. They are compacted/combined to produce a bigger sized parquet file (~220 to ~250 MB). It appears that we need as much as ~24 GB of memory to have a successful compaction. This gist https://gist.github.com/ndchandar/3900558ff719cefeb8b058e36a18f8be#file-parquet_rewriter-rs-L32-L138 (
export_with_datafusion) is the interesting bit. It basically lists all files in a directory, takes N files and compacts them. I tried giving hints to the optimizer that the sources are already sorted but it doesn't seem to help. The row group size is set to 1M.Giving less memory (E.g 12 or 16 gb), I am running into the below issue
I am trying to understand why the spill is not happening efficiently (I am relatively new to DataFusion). Looking for any help/hints to reduce the memory utilization
Beta Was this translation helpful? Give feedback.
All reactions