Ability to opt out of / improved automatic synchronization between tasks for shared array usage

A single array may be used concurrently on on different devices (when it's backed by unified memory), or just in different streams, in which case you don't want to synchronize the different streams involved. For example (pseudocode):

```julia
a = cu(rand(N, 2))

@async begin
  @cuda kernel(a[:, 1])
end

@async begin
  @cuda kernel(a[:, 2])
end
```

Here, the second kernel may end up waiting for the first one to complete, because we automatically synchronize when accessing the array from a different stream: https://github.com/JuliaGPU/CUDA.jl/blob/a4a9166e01161479146d6d16b684a964c623c2a5/src/memory.jl#L565-L569
This was identified in https://github.com/JuliaGPU/CUDA.jl/issues/2615, but note that this doesn't necessarily involve multiple GPUs, and would manifest when attempting to overlap kernel execution as well.

---

It's not immediately clear to me how to best solve this. @pxl-th suggested never synchronizing automatically between different tasks, but that doesn't seem like a viable option to me:

1. it would re-introduce the IMO surprising and hard to explain behavior of having to explicitly `synchronize()` on each exit path outside of an `@async` block to even make it _possible_ to read the data in a valid manner;
2. we cannot easily identify when the synchronization is happening between different tasks, unless we would also track tasks operating on array, which doesn't seem straightforward.

The first point is crucial to me. I don't want to have to explain to users that they basically can't safely use `CuArray`s in an `@async` block without having to explain the asynchronous nature of GPU computing.

To illustrate the second point:

```julia
device!(0)
a = cu(rand(N, 2))
@cuda kernel(a[:, 1])
device!(1)
@cuda kernel(a[:, 2])

# is detected the same as

device!(0)
a = cu(rand(N, 2))
@async begin
  device!(0)
  @cuda kernel(a[:, 1])
end
@async begin
  device!(1)
  @cuda kernel(a[:, 2])
end
```

---

Without having put too much thought in it, I wonder if we can't solve this differently. Essentially, what we want is a synchronization of the task-local stream before the task ends, so that you can safely `fetch` values from it. That isn't possible, so we opted for detecting when the fetched array is used on a different stream. I wonder if we should instead use a GPU-version of `@async` that inserts this synchronization automatically? Seems like that would hurt portability, though.

Note that this also wouldn't entirely obviate the tracking mechanism: We still need to know which stream was last used by an array operation so that we can efficiently free the array (in a way that only synchronizes that stream and not the whole device). The same applies to tracking the owning device: We now automatically enable P2P access when accessing memory from another device.

---

Alternatively, we could offer a way to opt out of the automatic behavior, either at array construction time, or by toggling a flag. Seems a bit messy, but would be the simplest solution.

cc @vchuravy 

	# accessing memory on another stream: ensure the data is ready and take ownership
	if managed.stream != state.stream
	maybe_synchronize(managed)
	managed.stream = state.stream
	end

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Ability to opt out of / improved automatic synchronization between tasks for shared array usage #2617

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Ability to opt out of / improved automatic synchronization between tasks for shared array usage #2617

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions