-
-
Notifications
You must be signed in to change notification settings - Fork 4.2k
Vendor a fork of async_executor to allow for deeper integration with bevy_tasks #20331
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
You added a new feature but didn't update the readme. Please run |
67f2a00 to
ce22ca3
Compare
| F: Future + 'a, | ||
| F::Output: 'a, | ||
| { | ||
| // SAFETY: Original implementation missing safety documentation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this safety comment mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was copied over from the other edge_executor spawn_* variants. It otherwise was missing the safety justification..
|
As I'm starting my review of this, I want to leave some broad observations.
I think I'll be focusing on the vendored code going forward. The other changes all look pretty reasonable. |
NthTensor
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a lot of work, and is a substantial improvement to both bevy_tasks and imo the original async_executor code. The safety comments all look correct to me, and I can't see any issues.
The original async_executor leaves something to be desired in terms of documentation. I'd prefer if documented all the types and functions in the vendored module. But I won't block on that.
|
Uh, for some reason my review comments don't seem to be loading. Not sure why that is. Anyone else seeing them? |
I only see one comment on the code. |
|
Looks like github ate them. I can still see them in codespaces, so I guess I'll have to copy them over? Kind of annoying. Give me a sec. |
| // `Waker`. | ||
| let (runnable, task) = unsafe { | ||
| Builder::new() | ||
| .propagate_panic(true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For some reason I thought this was the default behavior, but I looked it up and I was wrong.
This is correct, and I'll have to adjust forte to match.
| fn steal<T>(src: &ConcurrentQueue<T>, dest: &ConcurrentQueue<T>) { | ||
| // Half of `src`'s length rounded up. | ||
| let mut count = src.len(); | ||
|
|
||
| if count > 0 { | ||
| if let Some(capacity) = dest.capacity() { | ||
| // Don't steal more than fits into the queue. | ||
| count = count.min(capacity- dest.len()); | ||
| } | ||
|
|
||
| // Steal tasks. | ||
| for _ in 0..count { | ||
| let Ok(val) = src.pop() else { break }; | ||
| assert!(dest.push(val).is_ok()); | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may want to consider using a queue built for work-stealing, like crossbeam_deque or st3.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree we should try testing them out, but I do think this should be reserved for a later PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed.
| // Try the local queue. | ||
| if let Ok(r) = self.local_state.stealable_queue.pop() { | ||
| return Some(r); | ||
| } | ||
|
|
||
| // Try stealing from the global queue. | ||
| if let Ok(r) = self.state.queue.pop() { | ||
| steal(&self.state.queue, &self.local_state.stealable_queue); | ||
| return Some(r); | ||
| } | ||
|
|
||
| // Try stealing from other runners. | ||
| if let Ok(stealer_queues) = self.state.stealer_queues.try_read() { | ||
| // Pick a random starting point in the iterator list and rotate the list. | ||
| let n = stealer_queues.len(); | ||
| let start = _rng.usize(..n); | ||
| let iter = stealer_queues | ||
| .iter() | ||
| .chain(stealer_queues.iter()) | ||
| .skip(start) | ||
| .take(n); | ||
|
|
||
| // Remove this runner's local queue. | ||
| let iter = | ||
| iter.filter(|local| !core::ptr::eq(**local, &self.local_state.stealable_queue)); | ||
|
|
||
| // Try stealing from each local queue in the list. | ||
| for local in iter { | ||
| steal(*local, &self.local_state.stealable_queue); | ||
| if let Ok(r) = self.local_state.stealable_queue.pop() { | ||
| return Some(r); | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit, because this is from the original implementation. I somewhat disagree with the ordering here. I'm pretty sure rayon takes from the local queue, then from peers, then from the global queue, whereas this is local-global-peers. The rational for this is that it's better to finish your current work (stuff already queued on workers) before you start something new (by pulling from the injector queue).
Can we run some benchmarks with alternative orderings at some point?
A set of identical comments have been made on #20649, maybe that's where they went? |
|
Ah shoot, your right. |
hymm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks pretty good. I have some questions about some of the code I don't quite understand and still want to do some manual testing before I approve.
crates/bevy_app/src/app.rs
Outdated
| #[cfg(not(all(target_arch = "wasm32", feature = "web")))] | ||
| bevy_tasks::tick_global_task_pools_on_main_thread(); | ||
| } | ||
| while app.plugins_state() == PluginsState::Adding {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should probably have https://doc.rust-lang.org/std/hint/fn.spin_loop.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm might need to revert the tick_global_task_pools_on_main_thread changes, some of the examples are failling. May need to implement a block_on based on Executor::run here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I investigated why we needed the tick_global_task_pools_on_main_thread in the runners. It started in #8336 and was added to run the IoTask for initializing the renderer. This works in native single threaded, because we run the tasks inline in the spawn. On web this works since we detach the task and poll for the RenderDevice in finish. So the one case I can think of where this might matter is if we're running multithreaded, but we don't spawn any worker threads. I'd be a bit surprised if that's what's happening in ci, since I thought we spawned at least 1 thread in each pool by default.
| #[cfg(not(all(target_arch = "wasm32", feature = "web")))] | ||
| bevy_tasks::tick_global_task_pools_on_main_thread(); | ||
| } | ||
| while app.plugins_state() == PluginsState::Adding {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should use hint::spin_loop
| #[cfg(not(all(target_arch = "wasm32", feature = "web")))] | ||
| bevy_tasks::tick_global_task_pools_on_main_thread(); | ||
| } | ||
| while app.plugins_state() == PluginsState::Adding {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will this be able to make progress to spawned tasks in single threaded mode? I thought that was why tick_global_task_pools_on_main_thread was added originally.
| /// See [`Self::scope`] for more details in general about how scopes work. | ||
| pub fn scope_with_executor<'env, F, T>( | ||
| &self, | ||
| tick_task_pool_executor: bool, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This option was originally added to prevent main thread systems from blocking the render thread scope from finishing and vis versa (when false). Is removing this going to be a regression?
I think the ideal would be that the thread that the scope runs on would only run the tasks spawned on the scope, but not sure how feasible that would be. (Alternatively we could move to using async scope and systems to allow finishing the scope on any thread.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may very well be a regression there. The run call below would add the main/render thread to the list of executor threads without any block to what can run on them. Overall though, I don't think it's a great idea to be blocking those threads from running ComputeTaskPool tasks. Constrained systems with fewer cores (4 or fewer) would likely see underutilization of their hardware. Your suggestion of moving to async schedule runners to avoid the owning thread from hard-blocking might be a better approach. This may require "merging" adjacent schedules though.
I think the ideal would be that the thread that the scope runs on would only run the tasks spawned on the scope, but not sure how feasible that would be. (Alternatively we could move to using async scope and systems to allow finishing the scope on any thread.)
With this PR, there's a softer prioritization. Tasks will be scheduled locally to the current thread, if possible, so assuming no other thread steals those tasks, the scope will be the one to execute them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we merge this we should make an issue to look into fixing this as this will increase frame variance.
hymm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the histogram for the First schedule. Yellow is this pr. The tail of yellow bad times is due to the main thread running render schedule systems.
This problem extends into the main schedule. But we don't see this problem at the frame level since we're render bound.
The overall execution time hasn't changed much, but the overall distribution is tighter.
We do see improvements in the schedule execution if we ignore the long tail problems, when the timing is broken down more. So I think this is worth merging it. I think if we made a async scope and a Schedule::run_async then the wins would be more clear. (might also need async systems to run schedules). That work is blocked by removing nonsend resources.
Approving, but the hang on ci needs to be fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is the right direction: we need more control over this area of the code base if we're going to reduce system / task overhead, and even these initial improvements are quite nice.
I intend to merge this as we have a consensus of experts, but I'll ping maintainers and check the temperature first.



Objective
Improve the performance of Bevy's async task executor and clean up the code surrounding
TaskPool::scope. Fixes #20421.Solution
async_executor's codeLocalExecutorandThreadExecutorinto the coreExecutorimplementation, and poll their queues inExecutor::run.!Sendtypes for local queues (Mutex->UnsafeCell,ConcurrentQueue->VecDeque).Arcclones by using&'staticreferences to thread-local queues.This is a breaking change.
ThreadExecutoris nowThreadSpawnerand forbids ticking and running the executor, only allowing spawning tasks to the target thread.ExecutorandLocalExecutorwrappers are no longer public.Testing
I've only tested this against a few examples: 3d_scene and many_foxes.
This current implementation does seems to break animation systems.EDIT: That seems to be #20383.Performance Testing
All of the benchmarks including a multithreaded schedule run are included below. Overall, this seems to be significant win whenever there are a very large number of tasks (i.e. the empty_archetypes/par_for_each` benchmarks), with some being upwards of 2x faster due to lower contention.
Future Work
Implement theDone in Optimize TaskPools for use in static variables. #20649.StaticExecutoroptimization for the static usages of the TaskPools. Would likely need a new implementation of Optimize TaskPools for use in static variables. #12990rayon::spawn_broadcaststyle APIs for more aggressively waking up threads than what the current implementation allows.ThreadSpawnerwith a more integratedExecutor::spawn_to_thread(ThreadId, Fut)API instead, and use the ThreadIds tracked byNonSendresources to schedule those systems.FnOnce) for lower-overhead execution, would avoid some extra atomics and state management.block_onimplementation that ticks the local and thread-locked queues.