Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generic softmax #96

Open
MikeInnes opened this issue Feb 15, 2018 · 15 comments
Open

Generic softmax #96

MikeInnes opened this issue Feb 15, 2018 · 15 comments

Comments

@MikeInnes
Copy link
Contributor

Aside from adapt, this is probably the biggest blocker to running Flux with CLArrays right now.

@SimonDanisch
Copy link
Member

Should we create GPUNNlib for that?

@ChrisRackauckas
Copy link
Member

Or maybe just make GPUArrayMath.jl be a library of generic math functions implemented for GPUArrays and have it in there? That would be a good place to put other things like stencil implementations.

@jekbradbury
Copy link

Would the problem be solved if the naive softmax in https://github.com/JuliaGPU/CuArrays.jl/issues/45 were provided in NNLib for AbstractArrays rather than a CPU-specialized multithreaded implementation? That could then be overridden for ::GPUArray etc. with the multithreaded one used for ::Array? None of this is going to really make sense until Cassette lets us dispatch on "yeah, this is an abstract array of duals, but it's on the GPU!"

@MikeInnes
Copy link
Contributor Author

Yeah, that's probably the way to do it, and would delay having to write a kernel for it. It's annoying because we can't dispatch on CPU-ness, so we have to opt in all types like Array that support multithreading.

@MikeInnes
Copy link
Contributor Author

@SimonDanisch perhaps PyTorch's version can be ported as a generic kernel?

@SimonDanisch
Copy link
Member

Can you help me with some tests?
I have a version that isn't erroring anymore, but the values seem to be wrong.

HostSoftMaxForward(
        input::AbstractArray{T}, output::AbstractArray{T},
        outer_size::Int, dim_size::Int, inner_size::Int, dim::Int,
        Epilogue = SoftMaxForwardEpilogue
) 

I'm not entirely sure what would be good values for this...

@MikeInnes
Copy link
Contributor Author

Here's a basic implementation:

function softmax!(y, x; dim=1)
    tmp = maximum(x, dim)
    y .= exp.(x .- tmp)
    sum!(tmp, y)
    y ./= sum(y, dim)
end

softmax(x; dim = 1) = softmax!(similar(x), x, dim = dim)

It's not totally obvious to me what the sizes are supposed to be though, if that's what you mean. I guess there'll be wrapper code for it somewhere though.

@SimonDanisch
Copy link
Member

Yeah, I meant those - so they're gpu specific I guess....

@jekbradbury
Copy link

jekbradbury commented Feb 26, 2018

I don't remember exactly, but one of the three sizes definitely refers to the size of the dimension being reduced over and the two others might be the size of the (single, implicitly collapsed) dimensions major and minor to the reduction dimension. In that case if the input is rand(2, 3, 4, 5, 6) and the softmax dimension is dim=3 these would be 5x6=30, 4, and 2x3=6?

@SimonDanisch
Copy link
Member

Ok, I run out of time debugging this, but anyone can take a look at it here: #106

@jekbradbury
Copy link

I was curious whether the generic, broadcast-based implementation could be made as fast as a handwritten softmax, even in principle, so I started by running the following benchmark (times are kernel times from nvprof):

using CuArrays

# times are for the last benchmark in this file, on a GT 750M

function softmax!(out::AbstractArray, x::AbstractArray; dim=1)
  tmp = maximum(x, dim) # mapreducedim_kernel 70.6 ms
  out .= exp.(x .- tmp) # broadcast_kernel    36.1 ms
  sum!(tmp, out)        # mapreducedim_kernel 72.2 ms
  out ./= tmp           # broadcast_kernel    27.8 ms
end                     # total              207.3 ms
#                      vs cudnnSoftmaxForward 19.6 ms

function bench(sizes...)
    x = cu(rand(sizes...)); y = similar(x);
    CUDAdrv.synchronize()
    for _ in 1:3
        softmax!(y, x, dim=1);
        CUDAdrv.synchronize()
        CuArrays.CUDNN.cudnnSoftmaxForward(x, y)
        CUDAdrv.synchronize()
    end
end
# attention use case
bench(64, 64 * 32)
# classification use cases
bench(16384, 32)
bench(16384, 64 * 32) # times are using these sizes

The results are the same for the GPUArrays mapreducedim kernel, which is essentially identical to the CuArrays one. This suggests that there's a lot of performance left on the table with the default schedule used for broadcast_kernel and mapreducedim_kernel (at least on this tiny GPU), and my initial question isn't really relevant yet because each of the four sub-kernels takes longer than the overall cuDNN softmax.

@SimonDanisch
Copy link
Member

Thanks for those numbers, that's pretty interesting :) I hope to have some time to look into this!

@jekbradbury
Copy link

V100 numbers:

using CuArrays

# times are for the last benchmark in this file, on GT 750M and V100

macro gtime(expr)
   quote
       t = @elapsed begin
           x = $(esc(expr))
           CUDAdrv.synchronize(x)
       end
       println("time for (", $(QuoteNode(expr)), "): ", t * 1000, "ms")
   end
end

function softmax!(out::AbstractArray, x::AbstractArray; dim=1)
                              # kernel name          GT 750M   V100
 @gtime tmp = maximum(x, dim) # mapreducedim_kernel  70.6 ms  3.81 ms
 @gtime out .= exp.(x .- tmp) # broadcast_kernel     36.1 ms  0.65 ms
 @gtime sum!(tmp, out)        # mapreducedim_kernel  72.2 ms  3.76 ms
 @gtime out ./= tmp           # broadcast_kernel     27.8 ms  0.45 ms
end                           # total               207.3 ms  9.35 ms
#                            vs cudnnSoftmaxForward  19.6 ms  0.82 ms

add_dims(x::AbstractArray) = reshape(x, (1, 1, size(x)...))

function bench(sizes...)
   x = cu(rand(sizes...)); y = similar(x);
   for _ in 1:3
       @gtime softmax!(y, x, dim=1);
       xx = add_dims(x); yy = add_dims(y)
       @gtime CuArrays.CUDNN.cudnnSoftmaxForward(xx, yy)
   end
end
# attention use case
bench(64, 64 * 32)
# classification use cases
bench(16384, 32)
bench(16384, 64 * 32) # times are using these sizes

@jekbradbury
Copy link

I have 350 μs with this CUDAnative kernel (so more than twice as fast as cuDNN!)

@jekbradbury
Copy link

(whoops, that was blatantly buggy; after correcting it and updating the gist I'm matching cuDNN's time almost exactly)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants