Fix massive spill files for StringView/BinaryView columns II by adriangb · Pull Request #21633 · apache/datafusion

adriangb · 2026-04-14T22:35:57Z

Replaces #19444 which seems stuck.

martin-g · 2026-04-15T06:02:56Z


 [dependencies]
 arrow = { workspace = true }
+arrow-data = { workspace = true }


This dependency seems unused.
The only occurrence of arrow_data is at https://github.com/apache/datafusion/pull/21633/changes#diff-1f7d15c867929af294664ebbde4e8c9038186222cbb95ed86e527406cf066e84R463 for a test helper.

martin-g · 2026-04-15T06:16:48Z

            // on top of the sliced size for views buffer. This matches the intended semantics of
            // "bytes needed if we materialized exactly this slice into fresh buffers".
            // This is a workaround until https://github.com/apache/arrow-rs/issues/8230
            if let Some(sv) = array.as_any().downcast_ref::<StringViewArray>() {


The same is needed for BinaryViewArray, no ?

Add garbage collection for StringView and BinaryView arrays before spilling to disk. This prevents sliced arrays from carrying their entire original buffers when written to spill files. Changes: - Add gc_view_arrays() function to apply GC on view arrays - Integrate GC into InProgressSpillFile::append_batch() - Use simple threshold-based heuristic (100+ rows, 10KB+ buffer size) Fixes apache#19414 where GROUP BY on StringView columns created 820MB spill files instead of 33MB due to sliced arrays maintaining references to original buffers. Testing shows 80-98% reduction in spill file sizes for typical GROUP BY workloads.

- Replace row count heuristic with 10KB memory threshold - Improve documentation and add inline comments - Remove redundant test_exact_clickbench_issue_19414 - Maintains 96% reduction in spill file sizes

The SpillManager now handles GC for StringView/BinaryView arrays internally via gc_view_arrays(), making the organize_stringview_arrays() function in external sort redundant. Changes: - Remove organize_stringview_arrays() call and function from sort.rs - Use batch.clone() for early return (cheaper than creating new batch) - Use arrow_data::MAX_INLINE_VIEW_LEN constant instead of custom constant - Update comment in spill_manager.rs to reference gc_view_arrays()

Address review comments from PR apache#19444: - Replace row count heuristic with 10KB memory threshold - Add comprehensive documentation explaining GC rationale and mechanism - Use direct array parameter for better type safety - Maintain early return optimization for non-view arrays The GC now triggers based on actual buffer memory usage rather than row counts, providing more accurate and efficient garbage collection for sliced StringView/BinaryView arrays during spilling. Tests confirm 80%+ reduction in spill file sizes for pathological cases like ClickBench (820MB -> 33MB).

- Return post-GC sliced size from append_batch so callers use the correct post-GC size for memory accounting (fixes cetra3's CHANGES_REQUESTED: max_record_batch_size was measured pre-GC in sort.rs and spill_manager.rs) - Fix incorrect comment claiming Arrow gc() is a no-op; it always allocates new compact buffers - Add comment in should_gc_view_array explaining why we sum data_buffers directly instead of using get_buffer_memory_size() - Enhance append_batch doc comment with GC rationale per reviewer request - Reduce row counts in heavy GC tests

Address PR review: avoid duplicating data-buffer size calculation by deriving it from get_buffer_memory_size minus the views buffer.

adriangb · 2026-04-16T14:14:35Z

cc @alamb in case you want to re-review before merging.

otherwise I plan to merge this in a day or so

github-actions bot added the physical-plan Changes to the physical-plan crate label Apr 14, 2026

adriangb mentioned this pull request Apr 14, 2026

Fix massive spill files for StringView/BinaryView columns #19444

Open

adriangb changed the title ~~fix: gc StringView/BinaryView arrays before spilling to prevent write amplification~~ Fix massive spill files for StringView/BinaryView columns rev2 Apr 14, 2026

adriangb changed the title ~~Fix massive spill files for StringView/BinaryView columns rev2~~ Fix massive spill files for StringView/BinaryView columns II Apr 14, 2026

adriangb requested a review from alamb April 14, 2026 22:40

martin-g reviewed Apr 15, 2026

View reviewed changes

Comment thread datafusion/physical-plan/src/spill/mod.rs Outdated

martin-g approved these changes Apr 16, 2026

View reviewed changes

EeshanBembi and others added 9 commits April 16, 2026 08:54

Address PR review feedback for StringView/BinaryView GC

cd8ecda

- Replace row count heuristic with 10KB memory threshold - Improve documentation and add inline comments - Remove redundant test_exact_clickbench_issue_19414 - Maintains 96% reduction in spill file sizes

Apply cargo fmt

06b0df1

fix: remove unused import Array to fix clippy

7fd21a5

Use get_buffer_memory_size in should_gc_view_array

92a286b

Address PR review: avoid duplicating data-buffer size calculation by deriving it from get_buffer_memory_size minus the views buffer.

Address review feedback on spill view GC

1530ba7

adriangb force-pushed the fix-stringview-spill-gc-2 branch from bd7fa4c to 1530ba7 Compare April 16, 2026 13:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix massive spill files for StringView/BinaryView columns II#21633

Fix massive spill files for StringView/BinaryView columns II#21633
adriangb wants to merge 9 commits intoapache:mainfrom
pydantic:fix-stringview-spill-gc-2

adriangb commented Apr 14, 2026 •

edited

Loading

Uh oh!

martin-g Apr 15, 2026

Uh oh!

martin-g Apr 15, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

adriangb commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

adriangb commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

martin-g Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

martin-g Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

adriangb commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

adriangb commented Apr 14, 2026 •

edited

Loading