-
Notifications
You must be signed in to change notification settings - Fork 1.5k
feat: add multi level merge for sorting #15608
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add multi level merge for sorting #15608
Conversation
// TODO - add a check to see the actual number of available blocking threads | ||
let max_blocking_threads = max_blocking_threads.unwrap_or(128); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently tokio
don't expose the value of maximum number of blocking threads, and you have no way to know if you reached the limit as spawn_blocking
doesn't return an indication that the current pool is full.
// Only in-memory stream | ||
(0, 1) => Ok(sorted_streams.into_iter().next().unwrap()), | ||
|
||
// Only single sorted spill file so stream it | ||
(1, 0) => self | ||
.spill_manager | ||
.read_spill_as_stream(self.sorted_spill_files.drain(..).next().unwrap()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to add metrics in these cases
pub fn with_spill_manager(mut self, spill_manager: SpillManager) -> Self { | ||
self.spill_manager = Some(spill_manager); | ||
self | ||
} | ||
|
||
pub fn with_sorted_spill_files( | ||
mut self, | ||
sorted_spill_files: Vec<RefCountedTempFile>, | ||
) -> Self { | ||
self.sorted_spill_files = sorted_spill_files; | ||
self | ||
} | ||
|
||
pub fn with_max_blocking_threads(mut self, max_blocking_threads: usize) -> Self { | ||
self.max_blocking_threads = Some(max_blocking_threads); | ||
self | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need a better API than this, this is confusing as they are dependent
return Ok(Box::pin(MultiLevelSortPreservingMergeStream::new( | ||
spill_manager.unwrap(), | ||
schema.clone().unwrap(), | ||
sorted_spill_files, | ||
streams, | ||
expressions.clone(), | ||
metrics.clone().unwrap(), | ||
batch_size.unwrap(), | ||
reservation.unwrap(), | ||
max_blocking_threads, | ||
fetch, | ||
enable_round_robin_tie_breaker, | ||
)?)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will remove all the unwrap. first want to know what do you think about the solution
I think this is a similar problem as #14692, will check this out soon |
I saw the pr from @ashdnazg and it is a better solution in my opinion as it does not need to know the number of available tokio tasks, and just make the read spill work and although multi level merge sort should be implemented, @2010YOUY01 implementation (#15610) with adaption for
datafusion/datafusion/physical-plan/src/aggregates/row_hash.rs Lines 1053 to 1085 in 784df33
|
Which issue does this PR close?
Rationale for this change
To be able to sort any amount spill files without getting over the tokio blocking thread limit
What changes are included in this PR?
Currently the PR is in draft as I want your opinions about the design.
SortPreservingMergeStream
StreamingMergeBuilder
to use the new stream when having spill files and spill managerAre these changes tested?
Existing tests
Are there any user-facing changes?
Yes, kinda