-
Notifications
You must be signed in to change notification settings - Fork 83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generic softmax
#96
Comments
Should we create GPUNNlib for that? |
Or maybe just make GPUArrayMath.jl be a library of generic math functions implemented for GPUArrays and have it in there? That would be a good place to put other things like stencil implementations. |
Would the problem be solved if the naive softmax in https://github.com/JuliaGPU/CuArrays.jl/issues/45 were provided in NNLib for AbstractArrays rather than a CPU-specialized multithreaded implementation? That could then be overridden for |
Yeah, that's probably the way to do it, and would delay having to write a kernel for it. It's annoying because we can't dispatch on CPU-ness, so we have to opt in all types like |
@SimonDanisch perhaps PyTorch's version can be ported as a generic kernel? |
Can you help me with some tests? HostSoftMaxForward(
input::AbstractArray{T}, output::AbstractArray{T},
outer_size::Int, dim_size::Int, inner_size::Int, dim::Int,
Epilogue = SoftMaxForwardEpilogue
) I'm not entirely sure what would be good values for this... |
Here's a basic implementation: function softmax!(y, x; dim=1)
tmp = maximum(x, dim)
y .= exp.(x .- tmp)
sum!(tmp, y)
y ./= sum(y, dim)
end
softmax(x; dim = 1) = softmax!(similar(x), x, dim = dim) It's not totally obvious to me what the sizes are supposed to be though, if that's what you mean. I guess there'll be wrapper code for it somewhere though. |
Yeah, I meant those - so they're gpu specific I guess.... |
I don't remember exactly, but one of the three sizes definitely refers to the size of the dimension being reduced over and the two others might be the size of the (single, implicitly collapsed) dimensions major and minor to the reduction dimension. In that case if the input is |
Ok, I run out of time debugging this, but anyone can take a look at it here: #106 |
I was curious whether the generic, broadcast-based implementation could be made as fast as a handwritten softmax, even in principle, so I started by running the following benchmark (times are kernel times from nvprof): using CuArrays
# times are for the last benchmark in this file, on a GT 750M
function softmax!(out::AbstractArray, x::AbstractArray; dim=1)
tmp = maximum(x, dim) # mapreducedim_kernel 70.6 ms
out .= exp.(x .- tmp) # broadcast_kernel 36.1 ms
sum!(tmp, out) # mapreducedim_kernel 72.2 ms
out ./= tmp # broadcast_kernel 27.8 ms
end # total 207.3 ms
# vs cudnnSoftmaxForward 19.6 ms
function bench(sizes...)
x = cu(rand(sizes...)); y = similar(x);
CUDAdrv.synchronize()
for _ in 1:3
softmax!(y, x, dim=1);
CUDAdrv.synchronize()
CuArrays.CUDNN.cudnnSoftmaxForward(x, y)
CUDAdrv.synchronize()
end
end
# attention use case
bench(64, 64 * 32)
# classification use cases
bench(16384, 32)
bench(16384, 64 * 32) # times are using these sizes The results are the same for the GPUArrays mapreducedim kernel, which is essentially identical to the CuArrays one. This suggests that there's a lot of performance left on the table with the default schedule used for |
Thanks for those numbers, that's pretty interesting :) I hope to have some time to look into this! |
V100 numbers: using CuArrays
# times are for the last benchmark in this file, on GT 750M and V100
macro gtime(expr)
quote
t = @elapsed begin
x = $(esc(expr))
CUDAdrv.synchronize(x)
end
println("time for (", $(QuoteNode(expr)), "): ", t * 1000, "ms")
end
end
function softmax!(out::AbstractArray, x::AbstractArray; dim=1)
# kernel name GT 750M V100
@gtime tmp = maximum(x, dim) # mapreducedim_kernel 70.6 ms 3.81 ms
@gtime out .= exp.(x .- tmp) # broadcast_kernel 36.1 ms 0.65 ms
@gtime sum!(tmp, out) # mapreducedim_kernel 72.2 ms 3.76 ms
@gtime out ./= tmp # broadcast_kernel 27.8 ms 0.45 ms
end # total 207.3 ms 9.35 ms
# vs cudnnSoftmaxForward 19.6 ms 0.82 ms
add_dims(x::AbstractArray) = reshape(x, (1, 1, size(x)...))
function bench(sizes...)
x = cu(rand(sizes...)); y = similar(x);
for _ in 1:3
@gtime softmax!(y, x, dim=1);
xx = add_dims(x); yy = add_dims(y)
@gtime CuArrays.CUDNN.cudnnSoftmaxForward(xx, yy)
end
end
# attention use case
bench(64, 64 * 32)
# classification use cases
bench(16384, 32)
bench(16384, 64 * 32) # times are using these sizes |
I have 350 μs with this CUDAnative kernel (so more than twice as fast as cuDNN!) |
(whoops, that was blatantly buggy; after correcting it and updating the gist I'm matching cuDNN's time almost exactly) |
Aside from
adapt
, this is probably the biggest blocker to running Flux with CLArrays right now.The text was updated successfully, but these errors were encountered: