-
Notifications
You must be signed in to change notification settings - Fork 1.5k
[Epic]: Google Summer of Code 2025 Improving Spilling Execution #16065
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Welcome aboard! We're excited to collaborate with you for this GSoC project 😄 Regarding the plan, I can see the following sub-tasks:
I plan to open separate issues for each sub-task to better describe the problems and outline the approaches. Are there any other tasks worth exploring? I'm not very familiar with Arrow IPC internal, are there any stream reader/writer–related tasks we could also consider? @alamb |
In my opinion, I suggest starting and finishing with external sort -- having a robust and performance external sort can be a key building block for other algorithms (like sort-merge-join and potentially aggregation) So in my mind "robust and performant" means:
I would suggest holding off on Aggregation until @Rachelint has completed designing the blocked approach to memory management (e.g. #15591) as I think that blocked approach will be directly related to spilling (aka spilling individual blocks, etc)
Instead of a new join algorithm, I suggest you look into the existing MergeJoin that @comphead and others have invested much effort in optimizing. The basic idea is that for any buffering join (like NLJ or HJ) if memory is exhausted, switch at runtime to Sort-Merge-Join (that is sort the in memory buffer, sort the remaining other input, and then use sort merge join to merge) While this will not be as fast as reusing the hash buffers, I think it will be far easier to implement (especially after you have a robust implementation of spilling sort)
Some other potential things to consider if we don't already is:
|
Is your feature request related to a problem or challenge?
To support queries that exceed available memory, DataFusion must spill intermediate results to disk. As a continuation of the community effort on external query execution, this epic aims to improve the robustness of spilling execution and explore further performance optimizations.
This includes tracking which queries fail under specific memory limits, fixing bugs in external query execution, and addressing inefficiencies in the current implementation. An additional goal is to explore the feasibility of applying experimental optimizations proposed in academic papers, such as adaptive compression.
Describe the solution you'd like
1. Stabilize Larger-Than-Memory Queries
User Experience & Testing
TrackConsumersPool
by default indatafusion-cli
insta
Sort
SortExec
#16042Aggregate
Join
2. Optimize Spill File Format
TBD
Describe alternatives you've considered
While spilling for window functions and CTEs is currently not a focus, they remain potential areas for improvement.
Additional context
Related work:
The text was updated successfully, but these errors were encountered: