-
-
Notifications
You must be signed in to change notification settings - Fork 80
Performance regression with mapreduce #611
Comments
Not really a bug, but the only two options when creating the issue were 'bug report' and 'feature request'. |
Just an update, trying this again on the current master I get a further factor of 10 performance regression: julia> using BenchmarkTools, CuArrays
[ Info: Precompiling CuArrays [3a865a2d-5b23-5a0f-bc46-62713ec82fae]
WARNING: using CuArrays.BLAS in module Main conflicts with an existing identifier.
julia> function pi_mc_cu(nsamples)
xs = CuArrays.rand(nsamples); ys = CuArrays.rand(nsamples)
mapreduce((x, y) -> (x^2 + y^2) < 1.0, +, xs, ys, init=0) * 4/nsamples
end
pi_mc_cu (generic function with 1 method)
julia> @benchmark pi_mc_cu(10000000)
[ Info: Building the CUDAnative run-time library for your sm_75 device, this might take a while...
BenchmarkTools.Trial:
memory estimate: 17.08 KiB
allocs estimate: 494
--------------
minimum time: 11.079 ms (0.00% GC)
median time: 11.140 ms (0.00% GC)
mean time: 11.188 ms (0.30% GC)
maximum time: 13.158 ms (10.40% GC)
--------------
samples: 447
evals/sample: 1 |
I had hoped #642 would fix this, but it doesn't do much. Maybe the serial fallback for small arrays, as used to exist with the old GPUArrays and CuArrays mapreduce implementations, is crucial in this situation. Although the input isn't particularly tiny here, so I'd need to properly profile first. |
OK, one problem is the missing specialization for mapreduce with multiple containers, falling back to a separate call to map and reduce. |
Ahh, that makes sense. |
646: Improve mapreduce performance r=maleadt a=wongalvis14 ~More than 3-fold improvement over the latest implementation~ Benchmarking function from #611 First stage: Using the number of "max parallel threads a single block can hold" as the number of blocks, perform reduction with serial iteration if needed Second stage: Reduction in a single block, no serial iteration This approach aims to strike an optimal balance between workload of each thread, kernel launch overhead and parallel resource exhaustion. ``` New impl: julia> @benchmark pi_mc_cu(10000000) BenchmarkTools.Trial: memory estimate: 16.98 KiB allocs estimate: 468 -------------- minimum time: 2.520 ms (0.00% GC) median time: 2.536 ms (0.00% GC) mean time: 2.584 ms (0.64% GC) maximum time: 15.600 ms (50.62% GC) -------------- samples: 1930 evals/sample: 1 Old recursion impl: julia> @benchmark pi_mc_cu(10000000) BenchmarkTools.Trial: memory estimate: 17.05 KiB allocs estimate: 472 -------------- minimum time: 4.059 ms (0.00% GC) median time: 4.076 ms (0.00% GC) mean time: 4.130 ms (0.64% GC) maximum time: 23.199 ms (63.12% GC) -------------- samples: 1209 evals/sample: 1 Latest serial impl: BenchmarkTools.Trial: memory estimate: 7.81 KiB allocs estimate: 242 -------------- minimum time: 8.544 ms (0.00% GC) median time: 8.579 ms (0.00% GC) mean time: 8.622 ms (0.27% GC) maximum time: 26.172 ms (41.80% GC) -------------- samples: 580 evals/sample: 1 ``` Co-authored-by: wongalvis14 <[email protected]> Co-authored-by: Tim Besard <[email protected]>
1.7.3:
current master
Minor regression, but the reduction now always happens on the GPU, while old GPUArrays performed the second phase on the CPU (which is invalid when using GPU-specific functions). |
Here's an example for me on the master branch:
and here's that same example on the latest tagged version:
As you can see, I lost around a factor of 3 performance on the new master. I tested the master version with and without
JULIA_CUDA_USE_BINARYBUILDER=false
, so binary builder is not the problem. Likely due to #602The text was updated successfully, but these errors were encountered: