Fix massive spill files for StringView/BinaryView columns II#21633
Open
adriangb wants to merge 9 commits intoapache:mainfrom
Open
Fix massive spill files for StringView/BinaryView columns II#21633adriangb wants to merge 9 commits intoapache:mainfrom
adriangb wants to merge 9 commits intoapache:mainfrom
Conversation
martin-g
reviewed
Apr 15, 2026
|
|
||
| [dependencies] | ||
| arrow = { workspace = true } | ||
| arrow-data = { workspace = true } |
Member
There was a problem hiding this comment.
This dependency seems unused.
The only occurrence of arrow_data is at https://github.com/apache/datafusion/pull/21633/changes#diff-1f7d15c867929af294664ebbde4e8c9038186222cbb95ed86e527406cf066e84R463 for a test helper.
| // on top of the sliced size for views buffer. This matches the intended semantics of | ||
| // "bytes needed if we materialized exactly this slice into fresh buffers". | ||
| // This is a workaround until https://github.com/apache/arrow-rs/issues/8230 | ||
| if let Some(sv) = array.as_any().downcast_ref::<StringViewArray>() { |
Member
There was a problem hiding this comment.
The same is needed for BinaryViewArray, no ?
martin-g
reviewed
Apr 15, 2026
martin-g
approved these changes
Apr 16, 2026
Add garbage collection for StringView and BinaryView arrays before spilling to disk. This prevents sliced arrays from carrying their entire original buffers when written to spill files. Changes: - Add gc_view_arrays() function to apply GC on view arrays - Integrate GC into InProgressSpillFile::append_batch() - Use simple threshold-based heuristic (100+ rows, 10KB+ buffer size) Fixes apache#19414 where GROUP BY on StringView columns created 820MB spill files instead of 33MB due to sliced arrays maintaining references to original buffers. Testing shows 80-98% reduction in spill file sizes for typical GROUP BY workloads.
- Replace row count heuristic with 10KB memory threshold - Improve documentation and add inline comments - Remove redundant test_exact_clickbench_issue_19414 - Maintains 96% reduction in spill file sizes
The SpillManager now handles GC for StringView/BinaryView arrays internally via gc_view_arrays(), making the organize_stringview_arrays() function in external sort redundant. Changes: - Remove organize_stringview_arrays() call and function from sort.rs - Use batch.clone() for early return (cheaper than creating new batch) - Use arrow_data::MAX_INLINE_VIEW_LEN constant instead of custom constant - Update comment in spill_manager.rs to reference gc_view_arrays()
Address review comments from PR apache#19444: - Replace row count heuristic with 10KB memory threshold - Add comprehensive documentation explaining GC rationale and mechanism - Use direct array parameter for better type safety - Maintain early return optimization for non-view arrays The GC now triggers based on actual buffer memory usage rather than row counts, providing more accurate and efficient garbage collection for sliced StringView/BinaryView arrays during spilling. Tests confirm 80%+ reduction in spill file sizes for pathological cases like ClickBench (820MB -> 33MB).
- Return post-GC sliced size from append_batch so callers use the correct post-GC size for memory accounting (fixes cetra3's CHANGES_REQUESTED: max_record_batch_size was measured pre-GC in sort.rs and spill_manager.rs) - Fix incorrect comment claiming Arrow gc() is a no-op; it always allocates new compact buffers - Add comment in should_gc_view_array explaining why we sum data_buffers directly instead of using get_buffer_memory_size() - Enhance append_batch doc comment with GC rationale per reviewer request - Reduce row counts in heavy GC tests
Address PR review: avoid duplicating data-buffer size calculation by deriving it from get_buffer_memory_size minus the views buffer.
bd7fa4c to
1530ba7
Compare
Contributor
Author
|
cc @alamb in case you want to re-review before merging. otherwise I plan to merge this in a day or so |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Replaces #19444 which seems stuck.