Skip to content

Ability to opt out of / improved automatic synchronization between tasks for shared array usage #2617

@maleadt

Description

@maleadt

A single array may be used concurrently on on different devices (when it's backed by unified memory), or just in different streams, in which case you don't want to synchronize the different streams involved. For example (pseudocode):

a = cu(rand(N, 2))

@async begin
  @cuda kernel(a[:, 1])
end

@async begin
  @cuda kernel(a[:, 2])
end

Here, the second kernel may end up waiting for the first one to complete, because we automatically synchronize when accessing the array from a different stream:

CUDA.jl/src/memory.jl

Lines 565 to 569 in a4a9166

# accessing memory on another stream: ensure the data is ready and take ownership
if managed.stream != state.stream
maybe_synchronize(managed)
managed.stream = state.stream
end

This was identified in #2615, but note that this doesn't necessarily involve multiple GPUs, and would manifest when attempting to overlap kernel execution as well.


It's not immediately clear to me how to best solve this. @pxl-th suggested never synchronizing automatically between different tasks, but that doesn't seem like a viable option to me:

  1. it would re-introduce the IMO surprising and hard to explain behavior of having to explicitly synchronize() on each exit path outside of an @async block to even make it possible to read the data in a valid manner;
  2. we cannot easily identify when the synchronization is happening between different tasks, unless we would also track tasks operating on array, which doesn't seem straightforward.

The first point is crucial to me. I don't want to have to explain to users that they basically can't safely use CuArrays in an @async block without having to explain the asynchronous nature of GPU computing.

To illustrate the second point:

device!(0)
a = cu(rand(N, 2))
@cuda kernel(a[:, 1])
device!(1)
@cuda kernel(a[:, 2])

# is detected the same as

device!(0)
a = cu(rand(N, 2))
@async begin
  device!(0)
  @cuda kernel(a[:, 1])
end
@async begin
  device!(1)
  @cuda kernel(a[:, 2])
end

Without having put too much thought in it, I wonder if we can't solve this differently. Essentially, what we want is a synchronization of the task-local stream before the task ends, so that you can safely fetch values from it. That isn't possible, so we opted for detecting when the fetched array is used on a different stream. I wonder if we should instead use a GPU-version of @async that inserts this synchronization automatically? Seems like that would hurt portability, though.

Note that this also wouldn't entirely obviate the tracking mechanism: We still need to know which stream was last used by an array operation so that we can efficiently free the array (in a way that only synchronizes that stream and not the whole device). The same applies to tracking the owning device: We now automatically enable P2P access when accessing memory from another device.


Alternatively, we could offer a way to opt out of the automatic behavior, either at array construction time, or by toggling a flag. Seems a bit messy, but would be the simplest solution.

cc @vchuravy

Metadata

Metadata

Assignees

No one assigned

    Labels

    cuda arrayStuff about CuArray.good first issueGood for newcomershardThis is difficult.speculativeNot sure about this one yet.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions