Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support disabling implicit synchronization #2662

Merged
merged 3 commits into from
Apr 7, 2025

Conversation

vchuravy
Copy link
Member

@vchuravy vchuravy commented Feb 17, 2025

@maleadt is that what you had in mind for #2617

One of the tricky things is if we should flip the stream, or not.
But we are about to set dirty so I think we must, but that of course means it is possible to "miss" logical sync events within a task.

@spawn begin
   # operation A
   # task switch -- synchronize on a different task
   # operation B
end

Closes #2617

Copy link
Contributor

github-actions bot commented Feb 17, 2025

Your PR requires formatting changes to meet the project's style guidelines.
Please consider running Runic (git runic master) to apply these changes.

Click here to view the suggested changes.
diff --git a/src/array.jl b/src/array.jl
index bb5fc8d1d..1a40e8327 100644
--- a/src/array.jl
+++ b/src/array.jl
@@ -493,7 +493,7 @@ different tasks. This function allows to enable or disable this behavior.
     it is recommended to figure out a better model instead and file an issue or pull request.
     For more details see [this discussion](https://github.com/JuliaGPU/CUDA.jl/issues/2617).
 """
-function enable_synchronization!(arr::CuArray, enable::Bool=true)
+function enable_synchronization!(arr::CuArray, enable::Bool = true)
     arr.data[].synchronizing = enable
     return arr
 end
diff --git a/src/memory.jl b/src/memory.jl
index 6cce8a3ac..b30a899cd 100644
--- a/src/memory.jl
+++ b/src/memory.jl
@@ -503,8 +503,8 @@ mutable struct Managed{M}
   # which stream is currently using the memory.
   stream::CuStream
 
-  # whether accessing this memory can cause implicit synchronization
-  synchronizing::Bool
+    # whether accessing this memory can cause implicit synchronization
+    synchronizing::Bool
 
   # whether there are outstanding operations that haven't been synchronized
   dirty::Bool
@@ -512,11 +512,13 @@ mutable struct Managed{M}
   # whether the memory has been captured in a way that would make the dirty bit unreliable
   captured::Bool
 
-  function Managed(mem::AbstractMemory; stream = CUDA.stream(), synchronizing = true,
-                   dirty = true, captured = false)
+    function Managed(
+            mem::AbstractMemory; stream = CUDA.stream(), synchronizing = true,
+            dirty = true, captured = false
+        )
     # NOTE: memory starts as dirty, because stream-ordered allocations are only
     #       guaranteed to be physically allocated at a synchronization event.
-    new{typeof(mem)}(mem, stream, synchronizing, dirty, captured)
+        return new{typeof(mem)}(mem, stream, synchronizing, dirty, captured)
   end
 end
 
@@ -528,7 +530,7 @@ function synchronize(managed::Managed)
   managed.dirty = false
 end
 function maybe_synchronize(managed::Managed)
-  if managed.synchronizing && (managed.dirty || managed.captured)
+    return if managed.synchronizing && (managed.dirty || managed.captured)
     synchronize(managed)
   end
 end
diff --git a/test/base/array.jl b/test/base/array.jl
index f6959fffe..bc8a5dc24 100644
--- a/test/base/array.jl
+++ b/test/base/array.jl
@@ -52,10 +52,10 @@ using ChainRulesCore: add!!, is_inplaceable_destination
 end
 
 @testset "synchronization" begin
-  a = CUDA.zeros(2, 2)
-  synchronize(a)
-  CUDA.enable_synchronization!(a, false)
-  CUDA.enable_synchronization!(a)
+    a = CUDA.zeros(2, 2)
+    synchronize(a)
+    CUDA.enable_synchronization!(a, false)
+    CUDA.enable_synchronization!(a)
 end
 
 @testset "unsafe_wrap" begin

@maleadt
Copy link
Member

maleadt commented Feb 17, 2025

I would widen the scope, what if you have a unified array you're using on multiple devices? But basically that's what I was thinking of, yes.

Copy link

codecov bot commented Feb 17, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 88.78%. Comparing base (430b7d6) to head (38ebf8f).
Report is 2 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #2662      +/-   ##
==========================================
+ Coverage   88.62%   88.78%   +0.15%     
==========================================
  Files         153      153              
  Lines       13156    13154       -2     
==========================================
+ Hits        11660    11679      +19     
+ Misses       1496     1475      -21     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CUDA.jl Benchmarks

Benchmark suite Current: c1e04f2 Previous: 3d42ca2 Ratio
latency/precompile 46474076514.5 ns 46520878096 ns 1.00
latency/ttfp 6956156543 ns 7008237160 ns 0.99
latency/import 3631584313 ns 3631961055 ns 1.00
integration/volumerhs 9624598.5 ns 9623329 ns 1.00
integration/byval/slices=1 147057 ns 146619 ns 1.00
integration/byval/slices=3 425280 ns 424765 ns 1.00
integration/byval/reference 144868 ns 144768 ns 1.00
integration/byval/slices=2 286168 ns 285820 ns 1.00
integration/cudadevrt 103450 ns 103203 ns 1.00
kernel/indexing 14050.5 ns 13905 ns 1.01
kernel/indexing_checked 14697 ns 14547 ns 1.01
kernel/occupancy 631.9235294117647 ns 666.4058823529411 ns 0.95
kernel/launch 1997.8 ns 2007.2 ns 1.00
kernel/rand 18179 ns 16740 ns 1.09
array/reverse/1d 19600 ns 19411 ns 1.01
array/reverse/2d 23498.5 ns 23115.5 ns 1.02
array/reverse/1d_inplace 10204 ns 9745.333333333334 ns 1.05
array/reverse/2d_inplace 11681 ns 11354 ns 1.03
array/copy 21035.5 ns 20871 ns 1.01
array/iteration/findall/int 157996 ns 157360 ns 1.00
array/iteration/findall/bool 138733 ns 137944 ns 1.01
array/iteration/findfirst/int 153655 ns 152667 ns 1.01
array/iteration/findfirst/bool 154472 ns 154321 ns 1.00
array/iteration/scalar 72458.5 ns 73050 ns 0.99
array/iteration/logical 212821 ns 211424.5 ns 1.01
array/iteration/findmin/1d 40754 ns 40711.5 ns 1.00
array/iteration/findmin/2d 93781 ns 93411 ns 1.00
array/reductions/reduce/1d 34849 ns 43368 ns 0.80
array/reductions/reduce/2d 40485 ns 49343.5 ns 0.82
array/reductions/mapreduce/1d 32887 ns 35682 ns 0.92
array/reductions/mapreduce/2d 40794 ns 43698 ns 0.93
array/broadcast 20879 ns 20818 ns 1.00
array/copyto!/gpu_to_gpu 13707 ns 13676 ns 1.00
array/copyto!/cpu_to_gpu 208093 ns 207803 ns 1.00
array/copyto!/gpu_to_cpu 244906 ns 243284 ns 1.01
array/accumulate/1d 108414.5 ns 108135 ns 1.00
array/accumulate/2d 79828 ns 79400 ns 1.01
array/construct 1278.1 ns 1294.3000000000002 ns 0.99
array/random/randn/Float32 43467 ns 43629.5 ns 1.00
array/random/randn!/Float32 26256 ns 26224 ns 1.00
array/random/rand!/Int64 26918 ns 27041 ns 1.00
array/random/rand!/Float32 8736 ns 8637.333333333334 ns 1.01
array/random/rand/Int64 29891 ns 37854 ns 0.79
array/random/rand/Float32 13125 ns 12706 ns 1.03
array/permutedims/4d 60821.5 ns 60584 ns 1.00
array/permutedims/2d 54853 ns 54607 ns 1.00
array/permutedims/3d 55989 ns 55510 ns 1.01
array/sorting/1d 2776394.5 ns 2777205.5 ns 1.00
array/sorting/by 3367046 ns 3369336 ns 1.00
array/sorting/2d 1084585 ns 1083969 ns 1.00
cuda/synchronization/stream/auto 1051.2 ns 996.3846153846154 ns 1.06
cuda/synchronization/stream/nonblocking 6429.4 ns 6421.1 ns 1.00
cuda/synchronization/stream/blocking 822.3030303030303 ns 821.9183673469388 ns 1.00
cuda/synchronization/context/auto 1153 ns 1154.2 ns 1.00
cuda/synchronization/context/nonblocking 6583.8 ns 6536 ns 1.01
cuda/synchronization/context/blocking 897.1590909090909 ns 899.2982456140351 ns 1.00

This comment was automatically generated by workflow using github-action-benchmark.

@luraess
Copy link
Contributor

luraess commented Feb 17, 2025

The current proposition seems to fix the overlap (async execution) in Chmy (see PTsolvers/Chmy.jl#65). Unsure tho if it's the best way to handle this as besides having to call unsafe_disable_task_sync! for each array creation, one also has to call it after each resize! or unsafe_wrap, and potentially others which makes the usage rather fragile and potentially error-prone.

@maleadt
Copy link
Member

maleadt commented Feb 18, 2025

one also has to call it after each resize!

resize! should propagate the flag

or unsafe_wrap

That's not relevant here, as it's basically a constructor creating a new array. The idea is that this property is set on an array object, which is the only safe scope to do so.

@luraess
Copy link
Contributor

luraess commented Apr 2, 2025

@vchuravy would it be possible to resume this (and ideally bump a patch upon merge)? Your current proposition makes it possible to work around the implicit sync CUDA features on our side and PTsolvers/Chmy.jl#65 relies on the introduced support functions. Thanks!

@maleadt maleadt force-pushed the vc/unsafe_stream_switching branch from c1e04f2 to b85b5a6 Compare April 4, 2025 12:55
@maleadt maleadt force-pushed the vc/unsafe_stream_switching branch from b85b5a6 to d1aa63b Compare April 4, 2025 12:57
@maleadt maleadt added the cuda array Stuff about CuArray. label Apr 4, 2025
@maleadt
Copy link
Member

maleadt commented Apr 4, 2025

Rebased, and addressed review comments. @luraess Please verify this works.

@maleadt maleadt changed the title Support disabling automatic sync on task switch Support disabling implicit synchronization Apr 4, 2025
@luraess
Copy link
Contributor

luraess commented Apr 5, 2025

Rebased, and addressed review comments. @luraess Please verify this works.

Thanks, this works!

@maleadt maleadt enabled auto-merge (squash) April 6, 2025 06:52
@maleadt maleadt disabled auto-merge April 7, 2025 07:19
@maleadt maleadt merged commit 07f67d7 into master Apr 7, 2025
1 of 3 checks passed
@maleadt maleadt deleted the vc/unsafe_stream_switching branch April 7, 2025 07:19
luraess added a commit to PTsolvers/Chmy.jl that referenced this pull request Apr 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuda array Stuff about CuArray. performance How fast can we go?
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Ability to opt out of / improved automatic synchronization between tasks for shared array usage
4 participants