Skip to content

[Epic]: Google Summer of Code 2025 Improving Spilling Execution #16065

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
9 tasks
ding-young opened this issue May 16, 2025 · 2 comments
Open
9 tasks

[Epic]: Google Summer of Code 2025 Improving Spilling Execution #16065

ding-young opened this issue May 16, 2025 · 2 comments
Labels
enhancement New feature or request

Comments

@ding-young
Copy link
Contributor

ding-young commented May 16, 2025

Is your feature request related to a problem or challenge?

To support queries that exceed available memory, DataFusion must spill intermediate results to disk. As a continuation of the community effort on external query execution, this epic aims to improve the robustness of spilling execution and explore further performance optimizations.

This includes tracking which queries fail under specific memory limits, fixing bugs in external query execution, and addressing inefficiencies in the current implementation. An additional goal is to explore the feasibility of applying experimental optimizations proposed in academic papers, such as adaptive compression.

Describe the solution you'd like

1. Stabilize Larger-Than-Memory Queries

User Experience & Testing

Sort

Aggregate

  • Integrate ExternalSorter

Join

2. Optimize Spill File Format

TBD

Describe alternatives you've considered

While spilling for window functions and CTEs is currently not a focus, they remain potential areas for improvement.

Additional context

Related work:

@ding-young ding-young added the enhancement New feature or request label May 16, 2025
@2010YOUY01
Copy link
Contributor

2010YOUY01 commented May 17, 2025

Welcome aboard! We're excited to collaborate with you for this GSoC project 😄

Regarding the plan, I can see the following sub-tasks:

  1. Stabilize external sort and aggregate.
  2. Implement a memory-limited nested loop join. Some non-equality joins may only be supported by NLJ.
  3. Optimize the spill format, likely building on top of Arrow's IPC stream reader/writer.
    (And also improve UX/performance along the way)

I plan to open separate issues for each sub-task to better describe the problems and outline the approaches.

Are there any other tasks worth exploring? I'm not very familiar with Arrow IPC internal, are there any stream reader/writer–related tasks we could also consider? @alamb

@alamb
Copy link
Contributor

alamb commented May 19, 2025

Stabilize external sort and aggregate.

In my opinion, I suggest starting and finishing with external sort -- having a robust and performance external sort can be a key building block for other algorithms (like sort-merge-join and potentially aggregation)

So in my mind "robust and performant" means:

  1. It is possible to sort any sized data, subject to temp space requirements
  2. The time to sort data once spilling continues to grow as O(N*ln(N)) as the size of the data increases

I would suggest holding off on Aggregation until @Rachelint has completed designing the blocked approach to memory management (e.g. #15591) as I think that blocked approach will be directly related to spilling (aka spilling individual blocks, etc)

Implement a memory-limited nested loop join. Some non-equality joins may only be supported by NLJ.

Instead of a new join algorithm, I suggest you look into the existing MergeJoin that @comphead and others have invested much effort in optimizing.

The basic idea is that for any buffering join (like NLJ or HJ) if memory is exhausted, switch at runtime to Sort-Merge-Join (that is sort the in memory buffer, sort the remaining other input, and then use sort merge join to merge)

While this will not be as fast as reusing the hash buffers, I think it will be far easier to implement (especially after you have a robust implementation of spilling sort)

Optimize the spill format, likely building on top of Arrow's IPC stream reader/writer.
(And also improve UX/performance along the way)
Are there any other tasks worth exploring? I'm not very familiar with Arrow IPC internal, are there any stream reader/writer–related tasks we could also consider? @alamb

Some other potential things to consider if we don't already is:

  1. permit using compression
  2. investigate writing/reading runs with mmap to avoid copying the data again

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants