[Epic]: Google Summer of Code 2025 Improving Spilling Execution #16065

ding-young · 2025-05-16T14:37:55Z

Is your feature request related to a problem or challenge?

To support queries that exceed available memory, DataFusion must spill intermediate results to disk. As a continuation of the community effort on external query execution, this epic aims to improve the robustness of spilling execution and explore further performance optimizations.

This includes tracking which queries fail under specific memory limits, fixing bugs in external query execution, and addressing inefficiencies in the current implementation. An additional goal is to explore the feasibility of applying experimental optimizations proposed in academic papers, such as adaptive compression.

Describe the solution you'd like

1. Stabilize Larger-Than-Memory Queries

User Experience & Testing

enable TrackConsumersPool by default in datafusion-cli
improve err msg to guide how to adjust parameters
Enable sort query fuzzing with limited memory #15517
migrate tests to insta

Sort

Aggregate

Integrate ExternalSorter

Join

Memory limited nest loop join #15760

2. Optimize Spill File Format

TBD

Describe alternatives you've considered

While spilling for window functions and CTEs is currently not a focus, they remain potential areas for improvement.

Additional context

Related work:

The text was updated successfully, but these errors were encountered:

2010YOUY01 · 2025-05-17T08:52:25Z

Welcome aboard! We're excited to collaborate with you for this GSoC project 😄

Regarding the plan, I can see the following sub-tasks:

Stabilize external sort and aggregate.
Implement a memory-limited nested loop join. Some non-equality joins may only be supported by NLJ.
Optimize the spill format, likely building on top of Arrow's IPC stream reader/writer.
(And also improve UX/performance along the way)

I plan to open separate issues for each sub-task to better describe the problems and outline the approaches.

Are there any other tasks worth exploring? I'm not very familiar with Arrow IPC internal, are there any stream reader/writer–related tasks we could also consider? @alamb

alamb · 2025-05-19T17:24:05Z

Stabilize external sort and aggregate.

In my opinion, I suggest starting and finishing with external sort -- having a robust and performance external sort can be a key building block for other algorithms (like sort-merge-join and potentially aggregation)

So in my mind "robust and performant" means:

It is possible to sort any sized data, subject to temp space requirements
The time to sort data once spilling continues to grow as O(N*ln(N)) as the size of the data increases

I would suggest holding off on Aggregation until @Rachelint has completed designing the blocked approach to memory management (e.g. #15591) as I think that blocked approach will be directly related to spilling (aka spilling individual blocks, etc)

Implement a memory-limited nested loop join. Some non-equality joins may only be supported by NLJ.

Instead of a new join algorithm, I suggest you look into the existing MergeJoin that @comphead and others have invested much effort in optimizing.

The basic idea is that for any buffering join (like NLJ or HJ) if memory is exhausted, switch at runtime to Sort-Merge-Join (that is sort the in memory buffer, sort the remaining other input, and then use sort merge join to merge)

While this will not be as fast as reusing the hash buffers, I think it will be far easier to implement (especially after you have a robust implementation of spilling sort)

Optimize the spill format, likely building on top of Arrow's IPC stream reader/writer.
(And also improve UX/performance along the way)
Are there any other tasks worth exploring? I'm not very familiar with Arrow IPC internal, are there any stream reader/writer–related tasks we could also consider? @alamb

Some other potential things to consider if we don't already is:

permit using compression
investigate writing/reading runs with mmap to avoid copying the data again

ding-young added the enhancement New feature or request label May 16, 2025

ding-young mentioned this issue May 18, 2025

Set TrackConsumersPool as default in datafusion-cli #16081

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Epic]: Google Summer of Code 2025 Improving Spilling Execution #16065

[Epic]: Google Summer of Code 2025 Improving Spilling Execution #16065

ding-young commented May 16, 2025 •

edited

Loading

2010YOUY01 commented May 17, 2025 •

edited

Loading

alamb commented May 19, 2025

[Epic]: Google Summer of Code 2025 Improving Spilling Execution #16065

[Epic]: Google Summer of Code 2025 Improving Spilling Execution #16065

Comments

ding-young commented May 16, 2025 • edited Loading

Is your feature request related to a problem or challenge?

Describe the solution you'd like

1. Stabilize Larger-Than-Memory Queries

User Experience & Testing

Sort

Aggregate

Join

2. Optimize Spill File Format

Describe alternatives you've considered

Additional context

2010YOUY01 commented May 17, 2025 • edited Loading

alamb commented May 19, 2025

ding-young commented May 16, 2025 •

edited

Loading

2010YOUY01 commented May 17, 2025 •

edited

Loading