-
-
Notifications
You must be signed in to change notification settings - Fork 80
Conversation
Great! Seems like a solid improvement. I'll have a closer look soon, it would be nice if we could keep the launch configuration via a configuration function instead of manually having to cfunction. |
So it's still a factor of 4 behind the old tagged release (1.7.2) though right? |
@wongalvis14 and I discussed on Slack. Here are timings on my machine for
(@v1.4) pkg> add CuArrays#v1.7.2
Updating registry at `~/.julia/registries/General`
Updating git-repo `https://github.com/JuliaRegistries/General.git`
Resolving package versions...
Updating `~/.julia/environments/v1.4/Project.toml`
[3a865a2d] + CuArrays v1.7.2 #v1.7.2 (https://github.com/JuliaGPU/CuArrays.jl.git)
Updating `~/.julia/environments/v1.4/Manifest.toml`
[3895d2a7] + CUDAapi v3.1.0
[c5f51814] + CUDAdrv v6.0.0
[be33ccc6] + CUDAnative v2.10.2
[3a865a2d] + CuArrays v1.7.2 #v1.7.2 (https://github.com/JuliaGPU/CuArrays.jl.git)
[0c68f7d7] + GPUArrays v2.0.1
[929cbde3] + LLVM v1.3.4
[a759f4b9] + TimerOutputs v0.5.3
julia> using BenchmarkTools, CuArrays
[ Info: Precompiling CuArrays [3a865a2d-5b23-5a0f-bc46-62713ec82fae]
┌ Warning: Incompatibility detected between CUDA and LLVM 8.0+; disabling debug info emission for CUDA kernels
└ @ CUDAnative ~/.julia/packages/CUDAnative/hfulr/src/CUDAnative.jl:114
WARNING: using CuArrays.BLAS in module Main conflicts with an existing identifier.julia> using BenchmarkTools, CuArrays
julia> function pi_mc_cu(nsamples)
xs = CuArrays.rand(nsamples); ys = CuArrays.rand(nsamples)
mapreduce((x, y) -> (x^2 + y^2) < 1.0, +, xs, ys, init=0) * 4/nsamples
end
pi_mc_cu (generic function with 1 method)
julia> @benchmark pi_mc_cu(10000000)
[ Info: Building the CUDAnative run-time library for your sm_75 device, this might take a while...
BenchmarkTools.Trial:
memory estimate: 4.61 KiB
allocs estimate: 126
--------------
minimum time: 594.163 μs (0.00% GC)
median time: 658.573 μs (0.00% GC)
mean time: 671.493 μs (2.87% GC)
maximum time: 2.311 ms (55.14% GC)
--------------
samples: 7424
evals/sample: 1
(@v1.4) pkg> add https://github.com/JuliaGPU/CuArrays.jl.git#master
Updating git-repo `https://github.com/JuliaGPU/CuArrays.jl.git`
Resolving package versions...
Updating `~/.julia/environments/v1.4/Project.toml`
[3a865a2d] + CuArrays v2.0.0 #master (https://github.com/JuliaGPU/CuArrays.jl.git)
Updating `~/.julia/environments/v1.4/Manifest.toml`
[3895d2a7] + CUDAapi v4.0.0
[c5f51814] + CUDAdrv v6.2.1
[be33ccc6] + CUDAnative v3.0.1
[f68482b8] + Cthulhu v1.0.0
[3a865a2d] + CuArrays v2.0.0 #master (https://github.com/JuliaGPU/CuArrays.jl.git)
[0c68f7d7] + GPUArrays v3.1.0
[929cbde3] + LLVM v1.3.4
[dc548174] + TerminalMenus v0.1.0
[a759f4b9] + TimerOutputs v0.5.3
julia> using BenchmarkTools, CuArrays
[ Info: Precompiling CuArrays [3a865a2d-5b23-5a0f-bc46-62713ec82fae]
WARNING: using CuArrays.BLAS in module Main conflicts with an existing identifier.
julia> function pi_mc_cu(nsamples)
xs = CuArrays.rand(nsamples); ys = CuArrays.rand(nsamples)
mapreduce((x, y) -> (x^2 + y^2) < 1.0, +, xs, ys, init=0) * 4/nsamples
end
pi_mc_cu (generic function with 1 method)julia> @benchmark pi_mc_cu(10000000)
[ Info: Building the CUDAnative run-time library for your sm_75 device, this might take a while...
BenchmarkTools.Trial:
memory estimate: 7.81 KiB
allocs estimate: 245
--------------
minimum time: 10.014 ms (0.00% GC)
median time: 10.159 ms (0.00% GC)
mean time: 10.198 ms (0.31% GC)
maximum time: 11.559 ms (9.85% GC)
--------------
samples: 491
evals/sample: 1
(@v1.4) pkg> add https://github.com/wongalvis14/CuArrays.jl.git#mapreduce
Updating git-repo `https://github.com/wongalvis14/CuArrays.jl.git`
Updating registry at `~/.julia/registries/General`
Updating git-repo `https://github.com/JuliaRegistries/General.git`
Resolving package versions...
Updating `~/.julia/environments/v1.4/Project.toml`
[3a865a2d] + CuArrays v2.0.0 #mapreduce (https://github.com/wongalvis14/CuArrays.jl.git)
Updating `~/.julia/environments/v1.4/Manifest.toml`
[3895d2a7] + CUDAapi v4.0.0
[c5f51814] + CUDAdrv v6.2.1
[be33ccc6] + CUDAnative v3.0.1
[f68482b8] + Cthulhu v1.0.0
[3a865a2d] + CuArrays v2.0.0 #mapreduce (https://github.com/wongalvis14/CuArrays.jl.git)
[0c68f7d7] + GPUArrays v3.1.0
[929cbde3] + LLVM v1.3.4
[dc548174] + TerminalMenus v0.1.0
[a759f4b9] + TimerOutputs v0.5.3julia> using BenchmarkTools, CuArrays
[ Info: Precompiling CuArrays [3a865a2d-5b23-5a0f-bc46-62713ec82fae]
WARNING: using CuArrays.BLAS in module Main conflicts with an existing identifier.
julia> function pi_mc_cu(nsamples)
xs = CuArrays.rand(nsamples); ys = CuArrays.rand(nsamples)
mapreduce((x, y) -> (x^2 + y^2) < 1.0, +, xs, ys, init=0) * 4/nsamples
end
pi_mc_cu (generic function with 1 method)
julia> @benchmark pi_mc_cu(10000000)
BenchmarkTools.Trial:
memory estimate: 11.58 KiB
allocs estimate: 357
--------------
minimum time: 7.527 ms (0.00% GC)
median time: 7.715 ms (0.00% GC)
mean time: 7.795 ms (0.52% GC)
maximum time: 10.703 ms (13.98% GC)
--------------
samples: 642
evals/sample: 1 I'm seeing marginal gains, but still a very large regression over 1.7.2. |
@maleadt mentioned in #611 that it could be because |
Ohh, I see. Yeah, I missed that. |
This implementation is faster than the old one on 1D array mapreduce v1.7
New impl:
|
Continuing the approach of this PR, which already improved performance by a good 25% (I can't reproduce @wongalvis14's timings with my GPU), I'm now selecting a launch configuration based on the recommended grid size as returned by the occupancy API. Together with #663 that brings us back to the original performance. Not sure how the old GPUArrays implementation did that though, as it launched multiple blocks without device-wide synchronization (i.e. it only used a single kernel)... EDIT: ha, it did the reduction on the CPU, sneaky little bastard! https://github.com/JuliaGPU/GPUArrays.jl/blob/fc08102f999e999fd3c6ac176bda0af450925032/src/mapreduce.jl#L179-L180 |
bors r+ |
Build succeeded |
More than 3-fold improvement over the latest implementationBenchmarking function from #611
First stage: Using the number of "max parallel threads a single block can hold" as the number of blocks, perform reduction with serial iteration if needed
Second stage: Reduction in a single block, no serial iteration
This approach aims to strike an optimal balance between workload of each thread, kernel launch overhead and parallel resource exhaustion.