Support disabling implicit synchronization #2662

vchuravy · 2025-02-17T15:07:53Z

@maleadt is that what you had in mind for #2617

One of the tricky things is if we should flip the stream, or not.
But we are about to set dirty so I think we must, but that of course means it is possible to "miss" logical sync events within a task.

@spawn begin
   # operation A
   # task switch -- synchronize on a different task
   # operation B
end

Closes #2617

github-actions · 2025-02-17T15:08:35Z

Your PR requires formatting changes to meet the project's style guidelines.
Please consider running Runic (git runic master) to apply these changes.

Click here to view the suggested changes.

diff --git a/src/array.jl b/src/array.jl
index bb5fc8d1d..1a40e8327 100644
--- a/src/array.jl
+++ b/src/array.jl
@@ -493,7 +493,7 @@ different tasks. This function allows to enable or disable this behavior.
     it is recommended to figure out a better model instead and file an issue or pull request.
     For more details see [this discussion](https://github.com/JuliaGPU/CUDA.jl/issues/2617).
 """
-function enable_synchronization!(arr::CuArray, enable::Bool=true)
+function enable_synchronization!(arr::CuArray, enable::Bool = true)
     arr.data[].synchronizing = enable
     return arr
 end
diff --git a/src/memory.jl b/src/memory.jl
index 6cce8a3ac..b30a899cd 100644
--- a/src/memory.jl
+++ b/src/memory.jl
@@ -503,8 +503,8 @@ mutable struct Managed{M}
   # which stream is currently using the memory.
   stream::CuStream
 
-  # whether accessing this memory can cause implicit synchronization
-  synchronizing::Bool
+    # whether accessing this memory can cause implicit synchronization
+    synchronizing::Bool
 
   # whether there are outstanding operations that haven't been synchronized
   dirty::Bool
@@ -512,11 +512,13 @@ mutable struct Managed{M}
   # whether the memory has been captured in a way that would make the dirty bit unreliable
   captured::Bool
 
-  function Managed(mem::AbstractMemory; stream = CUDA.stream(), synchronizing = true,
-                   dirty = true, captured = false)
+    function Managed(
+            mem::AbstractMemory; stream = CUDA.stream(), synchronizing = true,
+            dirty = true, captured = false
+        )
     # NOTE: memory starts as dirty, because stream-ordered allocations are only
     #       guaranteed to be physically allocated at a synchronization event.
-    new{typeof(mem)}(mem, stream, synchronizing, dirty, captured)
+        return new{typeof(mem)}(mem, stream, synchronizing, dirty, captured)
   end
 end
 
@@ -528,7 +530,7 @@ function synchronize(managed::Managed)
   managed.dirty = false
 end
 function maybe_synchronize(managed::Managed)
-  if managed.synchronizing && (managed.dirty || managed.captured)
+    return if managed.synchronizing && (managed.dirty || managed.captured)
     synchronize(managed)
   end
 end
diff --git a/test/base/array.jl b/test/base/array.jl
index f6959fffe..bc8a5dc24 100644
--- a/test/base/array.jl
+++ b/test/base/array.jl
@@ -52,10 +52,10 @@ using ChainRulesCore: add!!, is_inplaceable_destination
 end
 
 @testset "synchronization" begin
-  a = CUDA.zeros(2, 2)
-  synchronize(a)
-  CUDA.enable_synchronization!(a, false)
-  CUDA.enable_synchronization!(a)
+    a = CUDA.zeros(2, 2)
+    synchronize(a)
+    CUDA.enable_synchronization!(a, false)
+    CUDA.enable_synchronization!(a)
 end
 
 @testset "unsafe_wrap" begin

maleadt · 2025-02-17T15:43:01Z

I would widen the scope, what if you have a unified array you're using on multiple devices? But basically that's what I was thinking of, yes.

codecov · 2025-02-17T17:48:38Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 88.78%. Comparing base (430b7d6) to head (38ebf8f).
Report is 2 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #2662      +/-   ##
==========================================
+ Coverage   88.62%   88.78%   +0.15%     
==========================================
  Files         153      153              
  Lines       13156    13154       -2     
==========================================
+ Hits        11660    11679      +19     
+ Misses       1496     1475      -21

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

github-actions

CUDA.jl Benchmarks

Benchmark suite	Current: `c1e04f2`	Previous: `3d42ca2`	Ratio
`latency/precompile`	`46474076514.5` ns	`46520878096` ns	`1.00`
`latency/ttfp`	`6956156543` ns	`7008237160` ns	`0.99`
`latency/import`	`3631584313` ns	`3631961055` ns	`1.00`
`integration/volumerhs`	`9624598.5` ns	`9623329` ns	`1.00`
`integration/byval/slices=1`	`147057` ns	`146619` ns	`1.00`
`integration/byval/slices=3`	`425280` ns	`424765` ns	`1.00`
`integration/byval/reference`	`144868` ns	`144768` ns	`1.00`
`integration/byval/slices=2`	`286168` ns	`285820` ns	`1.00`
`integration/cudadevrt`	`103450` ns	`103203` ns	`1.00`
`kernel/indexing`	`14050.5` ns	`13905` ns	`1.01`
`kernel/indexing_checked`	`14697` ns	`14547` ns	`1.01`
`kernel/occupancy`	`631.9235294117647` ns	`666.4058823529411` ns	`0.95`
`kernel/launch`	`1997.8` ns	`2007.2` ns	`1.00`
`kernel/rand`	`18179` ns	`16740` ns	`1.09`
`array/reverse/1d`	`19600` ns	`19411` ns	`1.01`
`array/reverse/2d`	`23498.5` ns	`23115.5` ns	`1.02`
`array/reverse/1d_inplace`	`10204` ns	`9745.333333333334` ns	`1.05`
`array/reverse/2d_inplace`	`11681` ns	`11354` ns	`1.03`
`array/copy`	`21035.5` ns	`20871` ns	`1.01`
`array/iteration/findall/int`	`157996` ns	`157360` ns	`1.00`
`array/iteration/findall/bool`	`138733` ns	`137944` ns	`1.01`
`array/iteration/findfirst/int`	`153655` ns	`152667` ns	`1.01`
`array/iteration/findfirst/bool`	`154472` ns	`154321` ns	`1.00`
`array/iteration/scalar`	`72458.5` ns	`73050` ns	`0.99`
`array/iteration/logical`	`212821` ns	`211424.5` ns	`1.01`
`array/iteration/findmin/1d`	`40754` ns	`40711.5` ns	`1.00`
`array/iteration/findmin/2d`	`93781` ns	`93411` ns	`1.00`
`array/reductions/reduce/1d`	`34849` ns	`43368` ns	`0.80`
`array/reductions/reduce/2d`	`40485` ns	`49343.5` ns	`0.82`
`array/reductions/mapreduce/1d`	`32887` ns	`35682` ns	`0.92`
`array/reductions/mapreduce/2d`	`40794` ns	`43698` ns	`0.93`
`array/broadcast`	`20879` ns	`20818` ns	`1.00`
`array/copyto!/gpu_to_gpu`	`13707` ns	`13676` ns	`1.00`
`array/copyto!/cpu_to_gpu`	`208093` ns	`207803` ns	`1.00`
`array/copyto!/gpu_to_cpu`	`244906` ns	`243284` ns	`1.01`
`array/accumulate/1d`	`108414.5` ns	`108135` ns	`1.00`
`array/accumulate/2d`	`79828` ns	`79400` ns	`1.01`
`array/construct`	`1278.1` ns	`1294.3000000000002` ns	`0.99`
`array/random/randn/Float32`	`43467` ns	`43629.5` ns	`1.00`
`array/random/randn!/Float32`	`26256` ns	`26224` ns	`1.00`
`array/random/rand!/Int64`	`26918` ns	`27041` ns	`1.00`
`array/random/rand!/Float32`	`8736` ns	`8637.333333333334` ns	`1.01`
`array/random/rand/Int64`	`29891` ns	`37854` ns	`0.79`
`array/random/rand/Float32`	`13125` ns	`12706` ns	`1.03`
`array/permutedims/4d`	`60821.5` ns	`60584` ns	`1.00`
`array/permutedims/2d`	`54853` ns	`54607` ns	`1.00`
`array/permutedims/3d`	`55989` ns	`55510` ns	`1.01`
`array/sorting/1d`	`2776394.5` ns	`2777205.5` ns	`1.00`
`array/sorting/by`	`3367046` ns	`3369336` ns	`1.00`
`array/sorting/2d`	`1084585` ns	`1083969` ns	`1.00`
`cuda/synchronization/stream/auto`	`1051.2` ns	`996.3846153846154` ns	`1.06`
`cuda/synchronization/stream/nonblocking`	`6429.4` ns	`6421.1` ns	`1.00`
`cuda/synchronization/stream/blocking`	`822.3030303030303` ns	`821.9183673469388` ns	`1.00`
`cuda/synchronization/context/auto`	`1153` ns	`1154.2` ns	`1.00`
`cuda/synchronization/context/nonblocking`	`6583.8` ns	`6536` ns	`1.01`
`cuda/synchronization/context/blocking`	`897.1590909090909` ns	`899.2982456140351` ns	`1.00`

This comment was automatically generated by workflow using github-action-benchmark.

luraess · 2025-02-17T23:07:26Z

The current proposition seems to fix the overlap (async execution) in Chmy (see PTsolvers/Chmy.jl#65). Unsure tho if it's the best way to handle this as besides having to call unsafe_disable_task_sync! for each array creation, one also has to call it after each resize! or unsafe_wrap, and potentially others which makes the usage rather fragile and potentially error-prone.

maleadt · 2025-02-18T11:27:28Z

one also has to call it after each resize!

resize! should propagate the flag

or unsafe_wrap

That's not relevant here, as it's basically a constructor creating a new array. The idea is that this property is set on an array object, which is the only safe scope to do so.

src/array.jl

luraess · 2025-04-02T10:05:32Z

@vchuravy would it be possible to resume this (and ideally bump a patch upon merge)? Your current proposition makes it possible to work around the implicit sync CUDA features on our side and PTsolvers/Chmy.jl#65 relies on the introduced support functions. Thanks!

maleadt · 2025-04-04T13:00:05Z

Rebased, and addressed review comments. @luraess Please verify this works.

luraess · 2025-04-05T22:27:58Z

Rebased, and addressed review comments. @luraess Please verify this works.

Thanks, this works!

because of JuliaGPU/CUDA.jl#2662

github-actions bot reviewed Feb 17, 2025

View reviewed changes

luraess mentioned this pull request Feb 17, 2025

Disable task sync in CUDA PTsolvers/Chmy.jl#65

Merged

utkinis mentioned this pull request Feb 18, 2025

Ability to opt out of / improved automatic synchronization between tasks for shared array usage #2617

Closed

kshyatt added the performance How fast can we go? label Feb 18, 2025

luraess reviewed Feb 24, 2025

View reviewed changes

src/array.jl Outdated Show resolved Hide resolved

maleadt force-pushed the vc/unsafe_stream_switching branch from c1e04f2 to b85b5a6 Compare April 4, 2025 12:55

vchuravy and others added 2 commits April 4, 2025 14:57

Support disabling automatic sync on task switch

958f38a

Generalize.

d1aa63b

maleadt force-pushed the vc/unsafe_stream_switching branch from b85b5a6 to d1aa63b Compare April 4, 2025 12:57

maleadt added the cuda array Stuff about CuArray. label Apr 4, 2025

maleadt changed the title ~~Support disabling automatic sync on task switch~~ Support disabling implicit synchronization Apr 4, 2025

Update array.jl

38ebf8f

maleadt enabled auto-merge (squash) April 6, 2025 06:52

maleadt disabled auto-merge April 7, 2025 07:19

maleadt merged commit 07f67d7 into master Apr 7, 2025
1 of 3 checks passed

maleadt deleted the vc/unsafe_stream_switching branch April 7, 2025 07:19

luraess added a commit to PTsolvers/Chmy.jl that referenced this pull request Apr 7, 2025

Update compat bound

a4e0ba2

because of JuliaGPU/CUDA.jl#2662

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support disabling implicit synchronization #2662

Support disabling implicit synchronization #2662

vchuravy commented Feb 17, 2025 •

edited by maleadt

Loading

github-actions bot commented Feb 17, 2025 •

edited

Loading

maleadt commented Feb 17, 2025

codecov bot commented Feb 17, 2025 •

edited

Loading

github-actions bot left a comment

luraess commented Feb 17, 2025

maleadt commented Feb 18, 2025

luraess commented Apr 2, 2025 •

edited

Loading

maleadt commented Apr 4, 2025 •

edited

Loading

luraess commented Apr 5, 2025

Support disabling implicit synchronization #2662

Support disabling implicit synchronization #2662

Conversation

vchuravy commented Feb 17, 2025 • edited by maleadt Loading

github-actions bot commented Feb 17, 2025 • edited Loading

maleadt commented Feb 17, 2025

codecov bot commented Feb 17, 2025 • edited Loading

Codecov Report

github-actions bot left a comment

Choose a reason for hiding this comment

CUDA.jl Benchmarks

luraess commented Feb 17, 2025

maleadt commented Feb 18, 2025

luraess commented Apr 2, 2025 • edited Loading

maleadt commented Apr 4, 2025 • edited Loading

luraess commented Apr 5, 2025

vchuravy commented Feb 17, 2025 •

edited by maleadt

Loading

github-actions bot commented Feb 17, 2025 •

edited

Loading

codecov bot commented Feb 17, 2025 •

edited

Loading

luraess commented Apr 2, 2025 •

edited

Loading

maleadt commented Apr 4, 2025 •

edited

Loading