Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Performance #11

Closed
chriselrod opened this issue Apr 5, 2020 · 8 comments
Closed

Improve Performance #11

chriselrod opened this issue Apr 5, 2020 · 8 comments

Comments

@chriselrod
Copy link
Collaborator

Here I wanted to talk about my experiments in PaddedMatrices.
I'd also suggest that if Gaius's performance were at least as good, I could use it and have PaddedMatrices mostly just define array types. My discussion of matmul here however will focus on Julia's base array type, rather than anything defined in PaddedMatrices -- which is why it'd make just as much sense for that code to live in a separate repo.
Although the implementation does currently make use of the PtrArray type defined in PaddedMatrices.

Additionally, OpenBLAS's release notes suggested AVX512 and AVX2 should both be seeing much better performance with release 0.3.9 than 0.3.7 (which Julia is currently on), so I checkout this PR which upgrades Julia's OpenBLAS to 0.3.9:
JuliaLang/julia#35113
Hopefully it'll get merged to master soon (looks like AArch64 is having problems).

I'm seeing a phenomenal performance improvement by OpenBLAS on that release.

PaddedMatrices currently provides two different implementation approaches: jmul! and jmulh!. It also offers jmult!, which is a threaded version of jmul!.
The function's aren't too complicated. Should be straightforward to copy them here, if you'd like to change approaches.

Here is a plot of percent theoretical performance vs size:
image

Interestingly, beyond around 200x200, OpenBLAS was now the fastest by a fairly comfortable margin.
It also looks like my implementation needs work. Aside from being slower than more OpenBLAS and MKL, it seems to slowly decline as size increases beyond 1000x1000, while OpenBLAS improves steadily and MKL sporadically.

But it's a marked improvement over Gaius on my computer -- hovering around 75% of peak CPU, while Gaius declined to barely more than a third of that by the largest tested sizes.

Script to generate the plot:

using Gaius, StructArrays, LinearAlgebra, BenchmarkTools
using PaddedMatrices: jmul!
BLAS.set_num_threads(1); Base.Threads.nthreads()

randa(::Type{T}, dim...) where {T} = rand(T, dim...)
randa(::Type{T}, dim...) where {T <: Signed} = rand(T(-100):T(200), dim...)

const LIBDIRECTCALLJIT = "/home/chriselrod/.julia/dev/LoopVectorization/benchmark/libdcjtest.so"
istransposed(x) = false
istransposed(x::Adjoint) = true
istransposed(x::Transpose) = true
mkl_set_num_threads(N::Integer) = ccall((:set_num_threads, LIBDIRECTCALLJIT), Cvoid, (Ref{UInt32},), Ref(N % UInt32))
function mklmul!(C::AbstractVecOrMat{Float32}, A::AbstractVecOrMat{Float32}, B::AbstractVecOrMat{Float32})
    M, N = size(C); K = size(B, 1)
    ccall(
        (:sgemmjit, LIBDIRECTCALLJIT), Cvoid,
        (Ptr{Float32},Ptr{Float32},Ptr{Float32},Ref{UInt32},Ref{UInt32},Ref{UInt32},Ref{Bool},Ref{Bool}),
        parent(C), parent(A), parent(B),
        Ref(M % UInt32), Ref(K % UInt32), Ref(N % UInt32),
        Ref(istransposed(A)), Ref(istransposed(B))
    )
end
function mklmul!(C::AbstractVecOrMat{Float64}, A::AbstractVecOrMat{Float64}, B::AbstractVecOrMat{Float64})
    M, N = size(C); K = size(B, 1)
    ccall(
        (:dgemmjit, LIBDIRECTCALLJIT), Cvoid,
        (Ptr{Float64},Ptr{Float64},Ptr{Float64},Ref{UInt32},Ref{UInt32},Ref{UInt32},Ref{Bool},Ref{Bool}),
        parent(C), parent(A), parent(B),
        Ref(M % UInt32), Ref(K % UInt32), Ref(N % UInt32),
        Ref(istransposed(A)), Ref(istransposed(B))
    )
end

mkl_set_num_threads(1)

function runbench(::Type{T}) where {T}
    (StructVector  map)([2, 4, 8:8:128..., round.(Int, (10:65) .^2.2)...]) do sz
        n, k, m = sz, sz, sz
        C1 = zeros(T, n, m)
        C2 = zeros(T, n, m)
        C3 = zeros(T, n, m)
        C4 = zeros(T, n, m)
        A  = randa(T, n, k)
        B  = randa(T, k, m)

        opb = @elapsed mul!(C1, A, B)
        if 2opb < BenchmarkTools.DEFAULT_PARAMETERS.seconds
            opb = min(opb, @belapsed mul!($C1, $A, $B))         #samples=100
        end
        lvb = @elapsed blocked_mul!(C2, A, B)
        if 2lvb < BenchmarkTools.DEFAULT_PARAMETERS.seconds
            lvb = min(lvb, @belapsed blocked_mul!($C2, $A, $B)) #samples=100
        end
        @assert C1  C2
        pmb = @elapsed jmul!(C3, A, B)
        if 2pmb < BenchmarkTools.DEFAULT_PARAMETERS.seconds
            pmb = min(pmb, @belapsed jmul!($C3, $A, $B))         #samples=100
        end
        @assert C1  C3
        if T <: Integer
            @show (matrix_size=sz, lvBLAS=lvb, OpenBLAS=opb, PaddedMatrices = pmb)
        else
            mklb = @elapsed mklmul!(C4, A, B)
            if 2mklb < BenchmarkTools.DEFAULT_PARAMETERS.seconds
                mklb = min(mklb, @belapsed mklmul!($C4, $A, $B))         #samples=100
            end
            @assert C1  C4
            @show (matrix_size=sz, lvBLAS=lvb, OpenBLAS=opb, PaddedMatrices = pmb, MKL = mklb)
        end
    end
end
tf64 = runbench(Float64);
tf32 = runbench(Float32);
ti64 = runbench(Int64);
ti32 = runbench(Int32);

gflops(sz, st) = 2e-9 * sz^3 /st
using VectorizationBase: REGISTER_SIZE, FMA3
# I don't know how to query GHz;
# Your best bet would be to check your bios
# Alternatives are to look up your CPU model or `watch -n1 "cat /proc/cpuinfo | grep MHz"`
# Boosts and avx downclocking complicate it.
const GHz = 4.1 
const W64 = REGISTER_SIZE ÷ sizeof(Float64) # vector width
const FMA_RATIO = FMA3 ? 2 : 1
const INSTR_PER_CLOCK = 2 # I don't know how to query this, but true for most recent CPUs
const PEAK_DGFLOPS = GHz * W64 * FMA_RATIO * INSTR_PER_CLOCK

using DataFrames, VegaLite
function create_df(res)
    df = DataFrame(
        Size = res.matrix_size,
        Gaius = res.lvBLAS,
        PaddedMatrices = res.PaddedMatrices,
        OpenBLAS = res.OpenBLAS,
        MKL = res.MKL
    );
    dfs = stack(df, [:Gaius, :PaddedMatrices, :OpenBLAS, :MKL], variable_name = :Library, value_name = :Time);
    dfs.GFLOPS = gflops.(dfs.Size, dfs.Time);
    dfs.Percent_Peak = 100 .* dfs.GFLOPS ./ PEAK_DGFLOPS;
    dfs
end

res = create_df(tf64)
plt = res |> @vlplot(
    :line, color = :Library,
    x = {:Size, scale={type=:log}}, y = {:Percent_Peak},#, scale={type=:log}},
    width = 900, height = 600
)
save(joinpath(PICTURES, "gemmf64.png"), plt)

It could use some work on threading. It eventually hung -> crashed on interrupt, which is a known issue with multi threading on Julia master.

julia> tf64 = runbench(Float64);
(matrix_size = sz, lvBLAS = lvb, OpenBLAS = opb, PaddedMatrices = pmb, MKL = mklb) = (matrix_size = 2, lvBLAS = 2.3133534136546185e-8, OpenBLAS = 1.61001001001001e-8, PaddedMatrices = 2.432831325301205e-8, MKL = 9.716261879619851e-8)
(matrix_size = sz, lvBLAS = lvb, OpenBLAS = opb, PaddedMatrices = pmb, MKL = mklb) = (matrix_size = 4, lvBLAS = 2.709547738693467e-8, OpenBLAS = 1.1704939626783753e-7, PaddedMatrices = 2.841608040201005e-8, MKL = 1.4730528846153845e-7)
(matrix_size = sz, lvBLAS = lvb, OpenBLAS = opb, PaddedMatrices = pmb, MKL = mklb) = (matrix_size = 8, lvBLAS = 4.726619433198381e-8, OpenBLAS = 1.4663221153846155e-7, PaddedMatrices = 5.7628687690742625e-8, MKL = 1.6828853754940714e-7)
(matrix_size = sz, lvBLAS = lvb, OpenBLAS = opb, PaddedMatrices = pmb, MKL = mklb) = (matrix_size = 16, lvBLAS = 1.292370203160271e-7, OpenBLAS = 2.966718146718147e-7, PaddedMatrices = 1.4987575392038602e-7, MKL = 2.406112469437653e-7)
(matrix_size = sz, lvBLAS = lvb, OpenBLAS = opb, PaddedMatrices = pmb, MKL = mklb) = (matrix_size = 24, lvBLAS = 2.877228464419476e-7, OpenBLAS = 5.840718232044199e-7, PaddedMatrices = 3.277236842105263e-7, MKL = 4.06315e-7)
(matrix_size = sz, lvBLAS = lvb, OpenBLAS = opb, PaddedMatrices = pmb, MKL = mklb) = (matrix_size = 32, lvBLAS = 6.516871165644171e-7, OpenBLAS = 1.0868e-6, PaddedMatrices = 7.048827586206896e-7, MKL = 8.12183908045977e-7)
(matrix_size = sz, lvBLAS = lvb, OpenBLAS = opb, PaddedMatrices = pmb, MKL = mklb) = (matrix_size = 40, lvBLAS = 5.594e-6, OpenBLAS = 1.9357e-6, PaddedMatrices = 1.2344e-6, MKL = 1.3357e-6)
(matrix_size = sz, lvBLAS = lvb, OpenBLAS = opb, PaddedMatrices = pmb, MKL = mklb) = (matrix_size = 48, lvBLAS = 6.238e-6, OpenBLAS = 2.9292222222222223e-6, PaddedMatrices = 2.477444444444444e-6, MKL = 2.0276666666666667e-6)
(matrix_size = sz, lvBLAS = lvb, OpenBLAS = opb, PaddedMatrices = pmb, MKL = mklb) = (matrix_size = 56, lvBLAS = 5.435e-6, OpenBLAS = 4.584571428571428e-6, PaddedMatrices = 3.721e-6, MKL = 3.086625e-6)
(matrix_size = sz, lvBLAS = lvb, OpenBLAS = opb, PaddedMatrices = pmb, MKL = mklb) = (matrix_size = 64, lvBLAS = 5.819e-6, OpenBLAS = 6.3324e-6, PaddedMatrices = 5.145333333333333e-6, MKL = 3.002e-6)
(matrix_size = sz, lvBLAS = lvb, OpenBLAS = opb, PaddedMatrices = pmb, MKL = mklb) = (matrix_size = 72, lvBLAS = 1.8404e-5, OpenBLAS = 1.1107e-5, PaddedMatrices = 7.32275e-6, MKL = 3.200625e-6)
(matrix_size = sz, lvBLAS = lvb, OpenBLAS = opb, PaddedMatrices = pmb, MKL = mklb) = (matrix_size = 80, lvBLAS = 2.0688e-5, OpenBLAS = 2.1968e-5, PaddedMatrices = 9.406e-6, MKL = 3.353125e-6)
(matrix_size = sz, lvBLAS = lvb, OpenBLAS = opb, PaddedMatrices = pmb, MKL = mklb) = (matrix_size = 88, lvBLAS = 1.9789e-5, OpenBLAS = 2.0916e-5, PaddedMatrices = 1.3432e-5, MKL = 3.8824285714285714e-6)
(matrix_size = sz, lvBLAS = lvb, OpenBLAS = opb, PaddedMatrices = pmb, MKL = mklb) = (matrix_size = 96, lvBLAS = 1.8742e-5, OpenBLAS = 2.2784e-5, PaddedMatrices = 1.7908e-5, MKL = 4.099428571428571e-6)
(matrix_size = sz, lvBLAS = lvb, OpenBLAS = opb, PaddedMatrices = pmb, MKL = mklb) = (matrix_size = 104, lvBLAS = 2.8341e-5, OpenBLAS = 2.3449e-5, PaddedMatrices = 2.0834e-5, MKL = 4.346714285714285e-6)
(matrix_size = sz, lvBLAS = lvb, OpenBLAS = opb, PaddedMatrices = pmb, MKL = mklb) = (matrix_size = 112, lvBLAS = 2.9761e-5, OpenBLAS = 2.3784e-5, PaddedMatrices = 2.5557e-5, MKL = 4.823285714285715e-6)
(matrix_size = sz, lvBLAS = lvb, OpenBLAS = opb, PaddedMatrices = pmb, MKL = mklb) = (matrix_size = 120, lvBLAS = 3.4951e-5, OpenBLAS = 2.3948e-5, PaddedMatrices = 3.3876e-5, MKL = 4.676857142857143e-6)
(matrix_size = sz, lvBLAS = lvb, OpenBLAS = opb, PaddedMatrices = pmb, MKL = mklb) = (matrix_size = 128, lvBLAS = 3.4679e-5, OpenBLAS = 2.4699e-5, PaddedMatrices = 4.0365e-5, MKL = 5.17e-6)
(matrix_size = sz, lvBLAS = lvb, OpenBLAS = opb, PaddedMatrices = pmb, MKL = mklb) = (matrix_size = 158, lvBLAS = 4.6798e-5, OpenBLAS = 5.6155e-5, PaddedMatrices = 8.2639e-5, MKL = 1.0184e-5)
(matrix_size = sz, lvBLAS = lvb, OpenBLAS = opb, PaddedMatrices = pmb, MKL = mklb) = (matrix_size = 195, lvBLAS = 6.4276e-5, OpenBLAS = 6.7389e-5, PaddedMatrices = 0.000157312, MKL = 1.3662e-5)
(matrix_size = sz, lvBLAS = lvb, OpenBLAS = opb, PaddedMatrices = pmb, MKL = mklb) = (matrix_size = 237, lvBLAS = 0.000102349, OpenBLAS = 8.5743e-5, PaddedMatrices = 0.000276445, MKL = 2.4355e-5)
(matrix_size = sz, lvBLAS = lvb, OpenBLAS = opb, PaddedMatrices = pmb, MKL = mklb) = (matrix_size = 282, lvBLAS = 0.000129655, OpenBLAS = 9.3392e-5, PaddedMatrices = 0.000390776, MKL = 3.1123e-5)
(matrix_size = sz, lvBLAS = lvb, OpenBLAS = opb, PaddedMatrices = pmb, MKL = mklb) = (matrix_size = 332, lvBLAS = 0.000178715, OpenBLAS = 0.000143093, PaddedMatrices = 0.00061251, MKL = 4.6239e-5)
(matrix_size = sz, lvBLAS = lvb, OpenBLAS = opb, PaddedMatrices = pmb, MKL = mklb) = (matrix_size = 387, lvBLAS = 0.000251371, OpenBLAS = 0.000233954, PaddedMatrices = 0.000880218, MKL = 8.6194e-5)
(matrix_size = sz, lvBLAS = lvb, OpenBLAS = opb, PaddedMatrices = pmb, MKL = mklb) = (matrix_size = 446, lvBLAS = 0.000341801, OpenBLAS = 0.000334483, PaddedMatrices = 0.001141395, MKL = 0.000148388)
(matrix_size = sz, lvBLAS = lvb, OpenBLAS = opb, PaddedMatrices = pmb, MKL = mklb) = (matrix_size = 509, lvBLAS = 0.000471455, OpenBLAS = 0.000403684, PaddedMatrices = 0.001459838, MKL = 0.000218345)
(matrix_size = sz, lvBLAS = lvb, OpenBLAS = opb, PaddedMatrices = pmb, MKL = mklb) = (matrix_size = 578, lvBLAS = 0.000601629, OpenBLAS = 0.000588513, PaddedMatrices = 0.00173781, MKL = 0.000281904)
(matrix_size = sz, lvBLAS = lvb, OpenBLAS = opb, PaddedMatrices = pmb, MKL = mklb) = (matrix_size = 651, lvBLAS = 0.000865018, OpenBLAS = 0.000740441, PaddedMatrices = 0.002665009, MKL = 0.000396309)
(matrix_size = sz, lvBLAS = lvb, OpenBLAS = opb, PaddedMatrices = pmb, MKL = mklb) = (matrix_size = 728, lvBLAS = 0.000850863, OpenBLAS = 0.000860389, PaddedMatrices = 0.003238482, MKL = 0.000501772)
(matrix_size = sz, lvBLAS = lvb, OpenBLAS = opb, PaddedMatrices = pmb, MKL = mklb) = (matrix_size = 811, lvBLAS = 0.001391089, OpenBLAS = 0.001634212, PaddedMatrices = 0.00392088, MKL = 0.000721424)
(matrix_size = sz, lvBLAS = lvb, OpenBLAS = opb, PaddedMatrices = pmb, MKL = mklb) = (matrix_size = 898, lvBLAS = 0.001917808, OpenBLAS = 0.001884708, PaddedMatrices = 0.004728201, MKL = 0.000931751)
(matrix_size = sz, lvBLAS = lvb, OpenBLAS = opb, PaddedMatrices = pmb, MKL = mklb) = (matrix_size = 990, lvBLAS = 0.002527852, OpenBLAS = 0.002173313, PaddedMatrices = 0.005980197, MKL = 0.001318157)
(matrix_size = sz, lvBLAS = lvb, OpenBLAS = opb, PaddedMatrices = pmb, MKL = mklb) = (matrix_size = 1088, lvBLAS = 0.003033522, OpenBLAS = 0.00255263, PaddedMatrices = 0.00694296, MKL = 0.001701482)
(matrix_size = sz, lvBLAS = lvb, OpenBLAS = opb, PaddedMatrices = pmb, MKL = mklb) = (matrix_size = 1190, lvBLAS = 0.004519446, OpenBLAS = 0.003599941, PaddedMatrices = 0.008287474, MKL = 0.002416439)
(matrix_size = sz, lvBLAS = lvb, OpenBLAS = opb, PaddedMatrices = pmb, MKL = mklb) = (matrix_size = 1297, lvBLAS = 0.005846859, OpenBLAS = 0.004166108, PaddedMatrices = 0.010642048, MKL = 0.002729079)
(matrix_size = sz, lvBLAS = lvb, OpenBLAS = opb, PaddedMatrices = pmb, MKL = mklb) = (matrix_size = 1409, lvBLAS = 0.007676825, OpenBLAS = 0.005031821, PaddedMatrices = 0.012124711, MKL = 0.003461087)
(matrix_size = sz, lvBLAS = lvb, OpenBLAS = opb, PaddedMatrices = pmb, MKL = mklb) = (matrix_size = 1527, lvBLAS = 0.009942827, OpenBLAS = 0.006043474, PaddedMatrices = 0.014811819, MKL = 0.004338835)
(matrix_size = sz, lvBLAS = lvb, OpenBLAS = opb, PaddedMatrices = pmb, MKL = mklb) = (matrix_size = 1649, lvBLAS = 0.01252906, OpenBLAS = 0.007397781, PaddedMatrices = 0.016704083, MKL = 0.005259515)
(matrix_size = sz, lvBLAS = lvb, OpenBLAS = opb, PaddedMatrices = pmb, MKL = mklb) = (matrix_size = 1777, lvBLAS = 0.015304498, OpenBLAS = 0.008571394, PaddedMatrices = 0.019045784, MKL = 0.006528608)
(matrix_size = sz, lvBLAS = lvb, OpenBLAS = opb, PaddedMatrices = pmb, MKL = mklb) = (matrix_size = 1910, lvBLAS = 0.019378427, OpenBLAS = 0.009448816, PaddedMatrices = 0.022792723, MKL = 0.008063704)
(matrix_size = sz, lvBLAS = lvb, OpenBLAS = opb, PaddedMatrices = pmb, MKL = mklb) = (matrix_size = 2048, lvBLAS = 0.051137142, OpenBLAS = 0.010611481, PaddedMatrices = 0.025143273, MKL = 0.010806103)
(matrix_size = sz, lvBLAS = lvb, OpenBLAS = opb, PaddedMatrices = pmb, MKL = mklb) = (matrix_size = 2191, lvBLAS = 0.030902612, OpenBLAS = 0.013496804, PaddedMatrices = 0.030446266, MKL = 0.011869482)
(matrix_size = sz, lvBLAS = lvb, OpenBLAS = opb, PaddedMatrices = pmb, MKL = mklb) = (matrix_size = 2340, lvBLAS = 0.040938443, OpenBLAS = 0.015844881, PaddedMatrices = 0.033634225, MKL = 0.015386527)
(matrix_size = sz, lvBLAS = lvb, OpenBLAS = opb, PaddedMatrices = pmb, MKL = mklb) = (matrix_size = 2494, lvBLAS = 0.047911804, OpenBLAS = 0.018747578, PaddedMatrices = 0.040595921, MKL = 0.018101961)
(matrix_size = sz, lvBLAS = lvb, OpenBLAS = opb, PaddedMatrices = pmb, MKL = mklb) = (matrix_size = 2654, lvBLAS = 0.058224365, OpenBLAS = 0.022381999, PaddedMatrices = 0.044077997, MKL = 0.020562614)
(matrix_size = sz, lvBLAS = lvb, OpenBLAS = opb, PaddedMatrices = pmb, MKL = mklb) = (matrix_size = 2819, lvBLAS = 0.073470818, OpenBLAS = 0.027041187, PaddedMatrices = 0.050594931, MKL = 0.025797043)
(matrix_size = sz, lvBLAS = lvb, OpenBLAS = opb, PaddedMatrices = pmb, MKL = mklb) = (matrix_size = 2989, lvBLAS = 0.090583819, OpenBLAS = 0.030988165, PaddedMatrices = 0.056894353, MKL = 0.030153094)
(matrix_size = sz, lvBLAS = lvb, OpenBLAS = opb, PaddedMatrices = pmb, MKL = mklb) = (matrix_size = 3165, lvBLAS = 0.10973272, OpenBLAS = 0.037257813, PaddedMatrices = 0.067711555, MKL = 0.03372817)
(matrix_size = sz, lvBLAS = lvb, OpenBLAS = opb, PaddedMatrices = pmb, MKL = mklb) = (matrix_size = 3346, lvBLAS = 0.13072293, OpenBLAS = 0.042420543, PaddedMatrices = 0.078059161, MKL = 0.037653038)
(matrix_size = sz, lvBLAS = lvb, OpenBLAS = opb, PaddedMatrices = pmb, MKL = mklb) = (matrix_size = 3533, lvBLAS = 0.171608963, OpenBLAS = 0.053996652, PaddedMatrices = 0.084271564, MKL = 0.047052213)
(matrix_size = sz, lvBLAS = lvb, OpenBLAS = opb, PaddedMatrices = pmb, MKL = mklb) = (matrix_size = 3725, lvBLAS = 0.212338202, OpenBLAS = 0.063929694, PaddedMatrices = 0.126757958, MKL = 0.054310823)
(matrix_size = sz, lvBLAS = lvb, OpenBLAS = opb, PaddedMatrices = pmb, MKL = mklb) = (matrix_size = 3923, lvBLAS = 0.247940048, OpenBLAS = 0.074915892, PaddedMatrices = 0.158934426, MKL = 0.063976922)
(matrix_size = sz, lvBLAS = lvb, OpenBLAS = opb, PaddedMatrices = pmb, MKL = mklb) = (matrix_size = 4127, lvBLAS = 0.295284462, OpenBLAS = 0.085741933, PaddedMatrices = 0.175556075, MKL = 0.070077551)
(matrix_size = sz, lvBLAS = lvb, OpenBLAS = opb, PaddedMatrices = pmb, MKL = mklb) = (matrix_size = 4336, lvBLAS = 0.315173005, OpenBLAS = 0.095162078, PaddedMatrices = 0.196636965, MKL = 0.08202249)
(matrix_size = sz, lvBLAS = lvb, OpenBLAS = opb, PaddedMatrices = pmb, MKL = mklb) = (matrix_size = 4551, lvBLAS = 0.426154178, OpenBLAS = 0.108096275, PaddedMatrices = 0.221490565, MKL = 0.100128825)
(matrix_size = sz, lvBLAS = lvb, OpenBLAS = opb, PaddedMatrices = pmb, MKL = mklb) = (matrix_size = 4771, lvBLAS = 0.452505485, OpenBLAS = 0.124190155, PaddedMatrices = 0.245447973, MKL = 0.105866294)
(matrix_size = sz, lvBLAS = lvb, OpenBLAS = opb, PaddedMatrices = pmb, MKL = mklb) = (matrix_size = 4997, lvBLAS = 0.504627015, OpenBLAS = 0.146173103, PaddedMatrices = 0.277543906, MKL = 0.125505135)
^Cfatal: error thrown and no exception handler available.
InterruptException()
jl_mutex_unlock at /home/chriselrod/Documents/languages/julia-old/src/locks.h:143 [inlined]
jl_task_get_next at /home/chriselrod/Documents/languages/julia-old/src/partr.c:441
poptaskref at ./task.jl:702
wait at ./task.jl:709 [inlined]
task_done_hook at ./task.jl:444
jl_apply at /home/chriselrod/Documents/languages/julia-old/src/julia.h:1687 [inlined]
jl_finish_task at /home/chriselrod/Documents/languages/julia-old/src/task.c:198
start_task at /home/chriselrod/Documents/languages/julia-old/src/task.c:697
unknown function (ip: (nil))

Results for the largest size:

julia> M = round(Int, 65^2.2)
9737

julia> A = rand(M,M); B = rand(M,M); C1 = Matrix{Float64}(undef,M,M); C2 = similar(C1); C3 = similar(C1); C4 = similar(C1);

julia> @time blocked_mul!(C1, A, B);
  5.515995 seconds (184.73 k allocations: 22.786 MiB, 0.16% gc time)

julia> @time jmult!(C2, A, B);
  1.578601 seconds (172.36 k allocations: 21.697 MiB)

julia> @time mul!(C3, A, B);
  0.890681 seconds

julia> @time mklmul!(C4, A, B);
  0.889272 seconds

julia> C1  C2  C3  C4
true

Associated GFLOPS (peak is min(num_cores, Threads.nthreads()) times the single threaded peak; that was 18x in this case):

julia> gflops(M, 5.515995) |> x -> (x, 100x/PEAK_DGFLOPS) # Gaius
(334.7199838118055, 14.17344104894163)

julia> gflops(M, 1.578601) |> x -> (x, 100x/PEAK_DGFLOPS) # PaddedMatrices
(1169.5886149229605, 49.52526316577576)

julia> gflops(M, 0.890681) |> x -> (x, 100x/PEAK_DGFLOPS) # OpenBLAS
(2072.923703442647, 87.77624083005789)

julia> gflops(M, 0.889272) |> x -> (x, 100x/PEAK_DGFLOPS) # MKL
(2076.208131039772, 87.91531720188738)

So OpenBLAS and MKL were both close 88% at this point.
PaddedMatrices was at just under half the theoretical peak, and Gaius was at about 14%.
A lot of room for improvement in both cases.

PaddedMatrices is very slow to start using threads. In the copy and paste from when the benchmarks were running, Gaius was faster until 2191 x 2191 matrices, where they were equally fast, but by 4997 x 4997, PaddedMatrices was approaching twice as fast, and at 9737 it was three times faster -- but still far behind OpenBLAS and MKL.

MKL did very well with multi-threading, even at relatively small sizes.
jmult! could be modified to ramp up thread use more intelligently.

The approach is summarized by this graphic:
image
For which you can thank the BLIS team.
LoopVectorization takes care of the microkernel, and 2 loops around it.
So the jmul! and jmult! functions add the other 3 loops (and also handle the remainders).
The function PaddedMatrices.matmul_params returns the mc, kc, and nc.

I've also compiled BLIS, so I'll add it to future benchmarks as well.

The above BLIS loop link also talks about threading. BLIS threads any of the loops aside from the k loop (I don't think you can multi-thread that one). 'PaddedMatrices.jmult!' threads both the 3rd and 5th. However, rather than threading in a smart way (i.e., in a manner reflecting the total number of threads), it spawns a new task for per iteration. When there are a lot of iterations, that's an excessive number of high-overhead tasks.
On the otherhand, the outer most loop takes steps of size 600 on my computer -- which also means it takes a long time before it uses many threads at all.

Would you be in favor of replacing the name-sake recursive implementation with this nested loop-based GEMM implementation?

@MasonProtter
Copy link
Owner

Thanks for the write up Chris, your approach in PaddedMatrices.jl is really promising.

Would you be in favor of replacing the name-sake recursive implementation with this nested loop-based GEMM implementation?

Absolutely. I've started a few times trying to look into your implementation in PaddedMatrices.jl and don't fully grok it yet, but I think I just need to sit down and spend more time on it. That's a nice diagram above, should be helpful.

How do you forsee this going? Just making Gaius depend on PaddedMatrices and then moving (an optimized version of) the matmul code from PaddedMatrices over to Gaius, or something else?

@chriselrod
Copy link
Collaborator Author

chriselrod commented Apr 6, 2020

Absolutely. I've started a few times trying to look into your implementation in PaddedMatrices.jl and don't fully grok it yet

Here is the main function:

function jmul!(C::AbstractMatrix{Tc}, A::AbstractMatrix{Ta}, B::AbstractMatrix{Tb}, ::Val{mc}, ::Val{kc}, ::Val{nc}, ::Val{log2kc}) where {Tc, Ta, Tb, mc, kc, nc, log2kc}
    M, K, N = matmul_sizes(C, A, B)
    Niter, Nrem = divrem(N, nc)
    Miter, Mrem = divrem(M, mc)
    (iszero(Miter) && iszero(Niter)) && return loopmul!(C, A, B)
    Km1 = K - 1
    if (1 << log2kc) == kc
        Kiter = Km1 >>> log2kc
        Krem = (Km1 & (kc - 1)) # Krem guaranteed to be > 0
    else
        Kiter, Krem = divrem(Km1, kc)
    end
    Krem += 1
    Aptr = stridedpointer(A)
    Bptr = stridedpointer(B)
    Cptr = stridedpointer(C)
    GC.@preserve C A B LCACHE begin
        for no in 0:Niter-1
            # Krem
            # pack kc x nc block of B
            Bpacked_krem = pack_B_krem(Bptr, Val{kc}(), Val{nc}(), Tb, Krem, no)
            # Bpacked_krem = PtrMatrix{-1,nc}(gesp(Bptr, (Kiter*kc, no*nc)), Krem)
            for mo in 0:Miter-1
                # pack mc x kc block of A
                Apacked_krem = pack_A_krem(Aptr, Val{mc}(), Ta, mo, Krem)
                Cpmat = PtrMatrix{mc,nc}(gesp(Cptr, (mo*mc, no*nc)))
                loopmul!(Cpmat, Apacked_krem, Bpacked_krem)
            end
            # Mrem
            if Mrem > 0
                Apacked_mrem_krem = pack_A_mrem_krem(Aptr, Val{mc}(), Ta, Mrem, Krem, Miter)
                Cpmat_mrem = PtrMatrix{-1,nc}(gesp(Cptr, (Miter*mc, no*nc)), Mrem)
                loopmul!(Cpmat_mrem, Apacked_mrem_krem, Bpacked_krem)
            end
            k = Krem
            for ko in 1:Kiter
                # pack kc x nc block of B
                Bpacked = pack_B(Bptr, Val{kc}(), Val{nc}(), Tb, k, no)
                for mo in 0:Miter-1
                    # pack mc x kc block of A
                    Apacked = pack_A(Aptr, Val{mc}(), Val{kc}(), Ta, mo, k)
                    Cpmat = PtrMatrix{mc,nc}(gesp(Cptr, (mo*mc, no*nc)))
                    loopmuladd!(Cpmat, Apacked, Bpacked)
                end
                # Mrem
                if Mrem > 0
                    Apacked_mrem = pack_A_mrem(Aptr, Val{mc}(), Val{kc}(), Ta, Mrem, k, Miter)
                    Cpmat_mrem = PtrMatrix{-1,nc}(gesp(Cptr, (Miter*mc, no*nc)), Mrem)
                    loopmuladd!(Cpmat_mrem, Apacked_mrem, Bpacked)
                end
                k += kc
            end
        end
        # Nrem
        if Nrem > 0
            # Krem
            # pack kc x nc block of B
            Bpacked_krem_nrem = pack_B_krem_nrem(Bptr, Val{kc}(), Val{nc}(), Tb, Krem, Nrem, Niter)
            for mo in 0:Miter-1
                # pack mc x kc block of A
                Apacked_krem = pack_A_krem(Aptr, Val{mc}(), Ta, mo, Krem)
                Cpmat_nrem = PtrMatrix{mc,-1}(gesp(Cptr, (mo*mc, Niter*nc)), Nrem)
                loopmul!(Cpmat_nrem, Apacked_krem, Bpacked_krem_nrem)
            end
            # Mrem
            if Mrem > 0
                Apacked_mrem_krem = pack_A_mrem_krem(Aptr, Val{mc}(), Ta, Mrem, Krem, Miter)
                Cpmat_mrem_nrem = PtrMatrix{-1,-1}(gesp(Cptr, (Miter*mc, Niter*nc)), Mrem, Nrem)
                loopmul!(Cpmat_mrem_nrem, Apacked_mrem_krem, Bpacked_krem_nrem)
            end
            k = Krem
            for ko in 1:Kiter
                # pack kc x nc block of B
                Bpacked_nrem = pack_B_nrem(Bptr, Val{kc}(), Val{nc}(), Tb, k, Nrem, Niter)
                for mo in 0:Miter-1
                    # pack mc x kc block of A
                    Apacked = pack_A(Aptr, Val{mc}(), Val{kc}(), Ta, mo, k)
                    Cpmat_nrem = PtrMatrix{mc,-1}(gesp(Cptr, (mo*mc, Niter*nc)), Nrem)
                    loopmuladd!(Cpmat_nrem, Apacked, Bpacked_nrem)
                end
                # Mrem
                if Mrem > 0
                    Apacked_mrem = pack_A_mrem(Aptr, Val{mc}(), Val{kc}(), Ta, Mrem, k, Miter)
                    Cpmat_mrem_nrem = PtrMatrix{-1,-1}(gesp(Cptr, (Miter*mc, Niter*nc)), Mrem, Nrem)
                    loopmuladd!(Cpmat_mrem_nrem, Apacked_mrem, Bpacked_nrem)
                end
                k += kc
            end
        end
    end # GC.@preserve
    C
end # function

It's easier to follow if we assume all the _rems are zero. Making that assumption and dropping all the rem sections, and we're left with just three nested loops:

    GC.@preserve C A B LCACHE begin
        for no in 0:Niter-1
            k = Krem # pretending all the rems are 0
            for ko in 1:Kiter
                # pack kc x nc block of B
                Bpacked = pack_B(Bptr, Val{kc}(), Val{nc}(), Tb, k, no)
                for mo in 0:Miter-1
                    # pack mc x kc block of A
                    Apacked = pack_A(Aptr, Val{mc}(), Val{kc}(), Ta, mo, k)
                    Cpmat = PtrMatrix{mc,nc}(gesp(Cptr, (mo*mc, no*nc)))
                    loopmuladd!(Cpmat, Apacked, Bpacked)
                end
                k += kc
            end
        end
    end # GC.@preserve

The three loops take steps of sizes nc, kc, and mc, respectively.
Matching the diagram, within the outer two most loops, we pack B.
Then in the m loop inside that, we pack A.

LoopVectorization then takes care of the inner two most loops and microkernel:

function loopmuladd!(
    C::AbstractMatrix, A::AbstractMatrix, B::AbstractMatrix
)
    M, K, N = matmul_sizes(C, A, B)
    @avx for n  1:N, m  1:M
        Cₘₙ = zero(eltype(C))
        for k  1:K
            Cₘₙ += A[m,k] * B[k,n]
            dummy1 = prefetch0(A, m, k + 3)
        end
        C[m,n] += Cₘₙ
    end
    nothing
end

to update a block of C with blocks from A and B.
One of the major differences between this and Gaius is that rather than taking views of A and B, it actually copies the memory ("packing" it) to be closer together. Packing B has an overhead of K * N, meaning the relative overhead goes to 0 as M increases. Packing A has an overhead of M * K * N, however, because it gets repacked Niter + (Nrem > 0) times.
Either way, packing itself is cheap relative to the cost of matmul.

I still need to learn more about how TLBs work, but the basic idea behind packing is that it caches mappings of virtual memory addresses to physical memory addresses. But the TLB has a limited numbers of entries. If you access a lot of memory that is far apart, e.g. one column after another in arrays with large numbers of rows, the TLB cache won't be able to hold them all, and it will have to start calculating the mappings, stalling computation while the CPU waits for memory.

Packing largely solves this by moving the memory close together. The TLB can store all the mappings of the packed submatrices, and then you reuse them many times.

Packing also lets you do things like align the columns of A. You probably noticed the spikes in performance on multiples of 8. Loads that cross 64 byte boundaries take twice as long; when you pack, you make sure no loads cross such a boundary (on AVX512, vector loads are 512 bits = 64 bytes, so if the pointer isn't aligned, every load crosses such a boundary, with AVX2 only every other load will).

Packing would also be when you switch to a struct of arrays representation (assuming the input isn't already a struct of arrays).

The next major point is choosing mc, kc, and nc. The packed block of B is supposed to fit in the L3 cache, and the packed block of A is supposed to fit in the L2.

julia> using PaddedMatrices: matmul_params

julia> matmul_params(Float64, Float64, Float64)
(200, 300, 600, 8)

These aren't necessarilly optimal. You may need to experiment with them on your own computer.
They're functions of both the kernel size LoopVectorization chooses (40 x 5 on AVX512, was once 32 x 4) and the cache sizes.

You can try different values with:

@time jmul!(C1, A, B, Val(200), Val(300), Val(600), Val(0));

Finally, the last consideration is prefetching memory.

There is also a hardware prefetcher that triggers automatically. jmulh! was an experiment to try and make better use of it, by loading elements from the packed A sequentially (by reorganizing the elements as it packs), and also only packing A once.
But it didn't do as well.

julia> M = round(Int, 65^2.2)
9737

julia> A = rand(M,M); B = rand(M,M); C1 = Matrix{Float64}(undef,M,M); C2 = similar(C1); C3 = similar(C1);

julia> @time jmul!(C1, A, B);
 19.174104 seconds

julia> @time jmulh!(C2, A, B);
 20.312452 seconds

julia> @time mul!(C3, A, B);
 15.483092 seconds

julia> 2e-9 * M^3 / 19.174104 # jmul! single core GFLOPS
96.29204875002243

julia> 2e-9 * M^3 / 20.312452 # jmulh! single core GFLOPS
90.89566129711963

julia> 2e-9 * M^3 / 15.483092 # OpenBLAS single core GFLOPS
119.24709593574721

Odds are LoopVectorization will need some modifications to do much better here.


P.S. If you're on linux, you may want to make sure the OS is using transparent huge pages (so that one TLB entry maps to more addresses):

julia> run(`cat /sys/kernel/mm/transparent_hugepage/enabled`); # path may be different based on distro
[always] madvise never

Always is selected on my computer, but that may not be the default for every distribution.

I think the default size for huge pages is 2MB (i.e., the size of a 512 x 512 matrix of Float64), but can be set to be a GB instead. Would be interesting to see if increasing the size helps.

@DilumAluthge
Copy link
Collaborator

Would you be in favor of replacing the name-sake recursive implementation with this nested loop-based GEMM implementation?

I've been thinking a bit about which package we should put this nested loop-based implementation in.

One option is to replace the divide-and-conquer implementation in Gaius.jl with the new implementation. I would prefer not to do this - I think it would be nice to keep the current implementation in Gaius for learning purposes, since it is a relatively straight-forward implementation.

A second option is to keep the divide-and-conquer algorithm in Gaius.jl, and also add the nested loop algorithm to Gaius. Unfortunately, this will (1) complicated the Gaius codebase, and (2) will result in an even longer Gaius test-suite. I think, from a learning point of view, it makes more sense to keep the Gaius codebase simple.

A third option is to put this algorithm into PaddedMatrices.jl. But, as Chris points out above, this code isn't specific to PaddedMatrices.jl, so it doesn't necessarily make sense for it to live there.

The fourth option is to put the nested loop algorithm into its own package. (We can certainly borrow code from Gaius.jl, PaddedMatrices.jl, etc.)

I've started work on option 4. I've created a package here: https://github.com/DilumAluthge/Augustus.jl

The name is just a placeholder, and we can change it at any time. Any suggestions for the name?

Now we just need to decide where this package lives. Options include:

  1. I transfer the repo to @MasonProtter. Since Gaius.jl is currently owned by @MasonProtter, it would make sense for Augustus.jl (or whatever we name it) to also be owned by @MasonProtter.
  2. I transfer the repo to @chriselrod. Since LoopVectorization.jl is currently owned by @chriselrod, it would make sense for @chriselrod to also own this new repo.
  3. I am a member of the JuliaLinearAlgebra org. If we want, I can transfer this new repo to the JuliaLinearAlgebra org instead.

What do y'all think?

@MasonProtter
Copy link
Owner

MasonProtter commented Dec 26, 2020

I don't particularly care where it lives. Another nice name, more in the vein of Gaius would be Octavian, but I don't have strong preferences.

If you're going to write the package, I don't see why it shouldn't be owned by you. If it becomes useful, it may be a good idea to transfer it to JuliaLinearAlgebra

@chriselrod
Copy link
Collaborator Author

What're the goals?
Pedagogical and maximal performance can sometimes be at odds -- although wherever they are, that's a sign that we're missing some abstractions or tools to make high performance easy.

The block sizes from this comment aren't that good, but it's an example implementation of the loop algorithm that doesn't take too many lines of code:
#19 (comment)

Another library that may be worth looking at, and that doesn't use LoopVectorization for its primary kernels, is:
https://github.com/YingboMa/MaBLAS.jl

A third option is to put this algorithm into PaddedMatrices.jl

PaddedMatrices already has a fairly simple single threaded implementation, and I'm working on a multi-threaded implementation at the moment.
It is simple in the sense that it uses a column-major memory layout. "Tile-major", which can be represented with 3-d arrays, is supposed to perform better, especially on older CPUs. Tile-major is the memory layout that libraries like OpenBLAS and BLIS use internally.
That's also of course the case for the code in the comment I linked to above.

@DilumAluthge
Copy link
Collaborator

What're the goals?
Pedagogical and maximal performance can sometimes be at odds -- although wherever they are, that's a sign that we're missing some abstractions or tools to make high performance easy.

I would say that for Octavian, it would be nice for the primary goal to be maximal performance.

@DilumAluthge
Copy link
Collaborator

Moving to JuliaLinearAlgebra/Octavian.jl#19

@chriselrod
Copy link
Collaborator Author

What're the goals?
Pedagogical and maximal performance can sometimes be at odds -- although wherever they are, that's a sign that we're missing some abstractions or tools to make high performance easy.

I would say that for Octavian, it would be nice for the primary goal to be maximal performance.

That's what I like to hear!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants