-
Notifications
You must be signed in to change notification settings - Fork 237
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support disabling implicit synchronization #2662
Conversation
Your PR requires formatting changes to meet the project's style guidelines. Click here to view the suggested changes.diff --git a/src/array.jl b/src/array.jl
index bb5fc8d1d..1a40e8327 100644
--- a/src/array.jl
+++ b/src/array.jl
@@ -493,7 +493,7 @@ different tasks. This function allows to enable or disable this behavior.
it is recommended to figure out a better model instead and file an issue or pull request.
For more details see [this discussion](https://github.com/JuliaGPU/CUDA.jl/issues/2617).
"""
-function enable_synchronization!(arr::CuArray, enable::Bool=true)
+function enable_synchronization!(arr::CuArray, enable::Bool = true)
arr.data[].synchronizing = enable
return arr
end
diff --git a/src/memory.jl b/src/memory.jl
index 6cce8a3ac..b30a899cd 100644
--- a/src/memory.jl
+++ b/src/memory.jl
@@ -503,8 +503,8 @@ mutable struct Managed{M}
# which stream is currently using the memory.
stream::CuStream
- # whether accessing this memory can cause implicit synchronization
- synchronizing::Bool
+ # whether accessing this memory can cause implicit synchronization
+ synchronizing::Bool
# whether there are outstanding operations that haven't been synchronized
dirty::Bool
@@ -512,11 +512,13 @@ mutable struct Managed{M}
# whether the memory has been captured in a way that would make the dirty bit unreliable
captured::Bool
- function Managed(mem::AbstractMemory; stream = CUDA.stream(), synchronizing = true,
- dirty = true, captured = false)
+ function Managed(
+ mem::AbstractMemory; stream = CUDA.stream(), synchronizing = true,
+ dirty = true, captured = false
+ )
# NOTE: memory starts as dirty, because stream-ordered allocations are only
# guaranteed to be physically allocated at a synchronization event.
- new{typeof(mem)}(mem, stream, synchronizing, dirty, captured)
+ return new{typeof(mem)}(mem, stream, synchronizing, dirty, captured)
end
end
@@ -528,7 +530,7 @@ function synchronize(managed::Managed)
managed.dirty = false
end
function maybe_synchronize(managed::Managed)
- if managed.synchronizing && (managed.dirty || managed.captured)
+ return if managed.synchronizing && (managed.dirty || managed.captured)
synchronize(managed)
end
end
diff --git a/test/base/array.jl b/test/base/array.jl
index f6959fffe..bc8a5dc24 100644
--- a/test/base/array.jl
+++ b/test/base/array.jl
@@ -52,10 +52,10 @@ using ChainRulesCore: add!!, is_inplaceable_destination
end
@testset "synchronization" begin
- a = CUDA.zeros(2, 2)
- synchronize(a)
- CUDA.enable_synchronization!(a, false)
- CUDA.enable_synchronization!(a)
+ a = CUDA.zeros(2, 2)
+ synchronize(a)
+ CUDA.enable_synchronization!(a, false)
+ CUDA.enable_synchronization!(a)
end
@testset "unsafe_wrap" begin |
I would widen the scope, what if you have a unified array you're using on multiple devices? But basically that's what I was thinking of, yes. |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #2662 +/- ##
==========================================
+ Coverage 88.62% 88.78% +0.15%
==========================================
Files 153 153
Lines 13156 13154 -2
==========================================
+ Hits 11660 11679 +19
+ Misses 1496 1475 -21 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CUDA.jl Benchmarks
Benchmark suite | Current: c1e04f2 | Previous: 3d42ca2 | Ratio |
---|---|---|---|
latency/precompile |
46474076514.5 ns |
46520878096 ns |
1.00 |
latency/ttfp |
6956156543 ns |
7008237160 ns |
0.99 |
latency/import |
3631584313 ns |
3631961055 ns |
1.00 |
integration/volumerhs |
9624598.5 ns |
9623329 ns |
1.00 |
integration/byval/slices=1 |
147057 ns |
146619 ns |
1.00 |
integration/byval/slices=3 |
425280 ns |
424765 ns |
1.00 |
integration/byval/reference |
144868 ns |
144768 ns |
1.00 |
integration/byval/slices=2 |
286168 ns |
285820 ns |
1.00 |
integration/cudadevrt |
103450 ns |
103203 ns |
1.00 |
kernel/indexing |
14050.5 ns |
13905 ns |
1.01 |
kernel/indexing_checked |
14697 ns |
14547 ns |
1.01 |
kernel/occupancy |
631.9235294117647 ns |
666.4058823529411 ns |
0.95 |
kernel/launch |
1997.8 ns |
2007.2 ns |
1.00 |
kernel/rand |
18179 ns |
16740 ns |
1.09 |
array/reverse/1d |
19600 ns |
19411 ns |
1.01 |
array/reverse/2d |
23498.5 ns |
23115.5 ns |
1.02 |
array/reverse/1d_inplace |
10204 ns |
9745.333333333334 ns |
1.05 |
array/reverse/2d_inplace |
11681 ns |
11354 ns |
1.03 |
array/copy |
21035.5 ns |
20871 ns |
1.01 |
array/iteration/findall/int |
157996 ns |
157360 ns |
1.00 |
array/iteration/findall/bool |
138733 ns |
137944 ns |
1.01 |
array/iteration/findfirst/int |
153655 ns |
152667 ns |
1.01 |
array/iteration/findfirst/bool |
154472 ns |
154321 ns |
1.00 |
array/iteration/scalar |
72458.5 ns |
73050 ns |
0.99 |
array/iteration/logical |
212821 ns |
211424.5 ns |
1.01 |
array/iteration/findmin/1d |
40754 ns |
40711.5 ns |
1.00 |
array/iteration/findmin/2d |
93781 ns |
93411 ns |
1.00 |
array/reductions/reduce/1d |
34849 ns |
43368 ns |
0.80 |
array/reductions/reduce/2d |
40485 ns |
49343.5 ns |
0.82 |
array/reductions/mapreduce/1d |
32887 ns |
35682 ns |
0.92 |
array/reductions/mapreduce/2d |
40794 ns |
43698 ns |
0.93 |
array/broadcast |
20879 ns |
20818 ns |
1.00 |
array/copyto!/gpu_to_gpu |
13707 ns |
13676 ns |
1.00 |
array/copyto!/cpu_to_gpu |
208093 ns |
207803 ns |
1.00 |
array/copyto!/gpu_to_cpu |
244906 ns |
243284 ns |
1.01 |
array/accumulate/1d |
108414.5 ns |
108135 ns |
1.00 |
array/accumulate/2d |
79828 ns |
79400 ns |
1.01 |
array/construct |
1278.1 ns |
1294.3000000000002 ns |
0.99 |
array/random/randn/Float32 |
43467 ns |
43629.5 ns |
1.00 |
array/random/randn!/Float32 |
26256 ns |
26224 ns |
1.00 |
array/random/rand!/Int64 |
26918 ns |
27041 ns |
1.00 |
array/random/rand!/Float32 |
8736 ns |
8637.333333333334 ns |
1.01 |
array/random/rand/Int64 |
29891 ns |
37854 ns |
0.79 |
array/random/rand/Float32 |
13125 ns |
12706 ns |
1.03 |
array/permutedims/4d |
60821.5 ns |
60584 ns |
1.00 |
array/permutedims/2d |
54853 ns |
54607 ns |
1.00 |
array/permutedims/3d |
55989 ns |
55510 ns |
1.01 |
array/sorting/1d |
2776394.5 ns |
2777205.5 ns |
1.00 |
array/sorting/by |
3367046 ns |
3369336 ns |
1.00 |
array/sorting/2d |
1084585 ns |
1083969 ns |
1.00 |
cuda/synchronization/stream/auto |
1051.2 ns |
996.3846153846154 ns |
1.06 |
cuda/synchronization/stream/nonblocking |
6429.4 ns |
6421.1 ns |
1.00 |
cuda/synchronization/stream/blocking |
822.3030303030303 ns |
821.9183673469388 ns |
1.00 |
cuda/synchronization/context/auto |
1153 ns |
1154.2 ns |
1.00 |
cuda/synchronization/context/nonblocking |
6583.8 ns |
6536 ns |
1.01 |
cuda/synchronization/context/blocking |
897.1590909090909 ns |
899.2982456140351 ns |
1.00 |
This comment was automatically generated by workflow using github-action-benchmark.
The current proposition seems to fix the overlap (async execution) in Chmy (see PTsolvers/Chmy.jl#65). Unsure tho if it's the best way to handle this as besides having to call |
That's not relevant here, as it's basically a constructor creating a new array. The idea is that this property is set on an array object, which is the only safe scope to do so. |
@vchuravy would it be possible to resume this (and ideally bump a patch upon merge)? Your current proposition makes it possible to work around the implicit sync CUDA features on our side and PTsolvers/Chmy.jl#65 relies on the introduced support functions. Thanks! |
c1e04f2
to
b85b5a6
Compare
b85b5a6
to
d1aa63b
Compare
Rebased, and addressed review comments. @luraess Please verify this works. |
Thanks, this works! |
because of JuliaGPU/CUDA.jl#2662
@maleadt is that what you had in mind for #2617
One of the tricky things is if we should flip the stream, or not.
But we are about to set dirty so I think we must, but that of course means it is possible to "miss" logical sync events within a task.
Closes #2617