Description
TDLR: isless
for 2-Tuples of Integers is significantly faster if implemented by packing both values into a longer Integer.
While experimenting with ways to improve the performance of sortperm
, I noticed that sorting tuples of UInt64 is much slower than sorting UInt128s. But tuples of UInt64s can be packed into a single UInt128:
pack(x::Tuple{UInt64, UInt64}) = UInt128(x[1])<<64 | x[2]
Sorting 10^7 tuples using by=pack is almost twice as fast on my machine compared to the default.
This can also be seen in the assembly code of the default isless and a modified version with packed integers. Not only is the packed version shorter, it is also branchless:
Assembly for default `isless`
# %bb.0: # %top
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset %rbp, -16
movq %rsp, %rbp
.cfi_def_cfa_register %rbp
; │ @ tuple.jl:541 within `isless` @ operators.jl:421
; │┌ @ int.jl:487 within `<`
movq (%rcx), %r8
movq (%rdx), %r9
movb $1, %al
cmpq %r9, %r8
; │└
; │ @ tuple.jl:541 within `isless`
jb .LBB0_4
# %bb.1: # %L6
jne .LBB0_2
# %bb.3: # %L8
; │ @ tuple.jl:541 within `isless` @ tuple.jl:541 @ operators.jl:421
; │┌ @ int.jl:487 within `<`
movq 8(%rcx), %rax
cmpq 8(%rdx), %rax
setb %al
.LBB0_4: # %common.ret
; │└
; │ @ tuple.jl:541 within `isless`
# kill: def $al killed $al killed $eax
popq %rbp
retq
.LBB0_2:
xorl %eax, %eax
# kill: def $al killed $al killed $eax
popq %rbp
retq
.Lfunc_end0:
.size julia_isless_1264, .Lfunc_end0-julia_isless_1264
.cfi_endproc
; └
# -- End function
Assembly for packed `isless`
# %bb.0: # %top
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset %rbp, -16
movq %rsp, %rbp
.cfi_def_cfa_register %rbp
; │┌ @ d:\Julia_dev\sorting\tuplecompare\widen3.jl:17 within `pack`
; ││┌ @ d:\Julia_dev\sorting\tuplecompare\widen3.jl:5 within `shiftwiden`
; │││┌ @ d:\Julia_dev\sorting\tuplecompare\widen3.jl:5 within `macro expansion`
; ││││┌ @ operators.jl:882 within `widen`
; │││││┌ @ number.jl:7 within `convert`
; ││││││┌ @ boot.jl:790 within `UInt128`
; │││││││┌ @ boot.jl:775 within `toUInt128`
movq (%rcx), %rax
; ││└└└└└└
; ││┌ @ int.jl:1040 within `|`
; │││┌ @ int.jl:525 within `rem`
; ││││┌ @ number.jl:7 within `convert`
; │││││┌ @ boot.jl:790 within `UInt128`
; ││││││┌ @ boot.jl:775 within `toUInt128`
movq 8(%rcx), %rcx
; │└└└└└└
; │ @ d:\Julia_dev\sorting\tuplecompare\widen3.jl:21 within `isless` @ operators.jl:421
; │┌ @ int.jl:487 within `<`
cmpq 8(%rdx), %rcx
sbbq (%rdx), %rax
setb %al
; │└
; │ @ d:\Julia_dev\sorting\tuplecompare\widen3.jl:21 within `isless`
popq %rbp
retq
.Lfunc_end0:
.size julia_isless_1694, .Lfunc_end0-julia_isless_1694
.cfi_endproc
; └
# -- End function
So I generalized this for all Integers up to 64 bit, and using
this benchmarking script
using BenchmarkTools
using NamedArrays
n = 10^7;
randomTuples!(v::AbstractArray{Tuple{T1,T2}}) where {T1,T2} = map!(_->(rand(T1),rand(T2)),v,v)
packabletypes = [UInt8,Int8,UInt16,Int16,UInt32,Int32,UInt64,Int64]
N = length(packabletypes)
# benchmark base
times1 = zeros(N,N)
for (i,T1) in enumerate(packabletypes), (j,T2) in enumerate(packabletypes)
v = zip(rand(T1,n),rand(T2,n)) |> collect;
times1[i,j] = @belapsed sort!($v) setup=(randomTuples!($v)) evals=1
end
# add isless definitions
import Base.isless
# widen a until both a and b fit, then shift a into the higher bits
@generated function shiftwiden(a,b)
sizea = sizeof(a)
ex = :a
while sizea < sizeof(b)
ex = :(widen($ex))
sizea *= 2
end
:(widen($ex) << (8*$sizea))
end
maybe_flipsignbit(x::T) where T<:Unsigned = x
maybe_flipsignbit(x::T) where T<:Signed = reinterpret(unsigned(T), x ⊻ typemin(T))
pack(t::Tuple{T1, T2}) where {T1<:Union{packabletypes...}, T2<:Union{packabletypes...}} =
shiftwiden(t...) | maybe_flipsignbit(t[2])
isless(a::Tuple{T1, T2}, b::Tuple{T1, T2}) where {T1<:Union{packabletypes...}, T2<:Union{packabletypes...}} =
isless(pack(a), pack(b))
# benchmark modified
times2 = zeros(N,N)
for (i,T1) in enumerate(packabletypes), (j,T2) in enumerate(packabletypes)
v = zip(rand(T1,n),rand(T2,n)) |> collect;
times2[i,j] = @belapsed sort!($v) setup=(randomTuples!($v)) evals=1
end
speedup = NamedArray(round.(times1./times2,sigdigits=3))
setnames!(speedup, string.(packabletypes), 1)
setnames!(speedup, string.(packabletypes), 2)
println("\nspeedup:")
speedup
I get the following speedups for sorting 10^7 tuples, depending on the element types:
speedup:
8×8 Named Matrix{Float64}
A ╲ B │ UInt8 Int8 UInt16 Int16 UInt32 Int32 UInt64 Int64
───────┼───────────────────────────────────────────────────────────────
UInt8 │ 2.39 2.2 2.12 1.98 1.7 1.67 1.0 0.995
Int8 │ 2.15 1.91 1.88 1.73 1.55 1.58 0.968 0.966
UInt16 │ 2.32 2.24 2.35 2.21 2.21 2.02 1.67 1.65
Int16 │ 2.09 1.96 2.16 2.02 1.95 1.91 1.67 1.52
UInt32 │ 1.82 1.82 2.1 2.03 2.22 2.05 1.96 1.75
Int32 │ 1.79 1.65 1.86 1.78 1.9 1.81 1.7 1.6
UInt64 │ 1.08 1.09 1.62 1.62 1.94 1.83 1.96 1.82
Int64 │ 0.969 0.983 1.56 1.57 1.77 1.67 1.72 1.63
(on this system)
Julia Version 1.9.0-beta4
Commit b75ddb787f (2023-02-07 21:53 UTC)
Platform Info:
OS: Windows (x86_64-w64-mingw32)
CPU: 24 × AMD Ryzen 9 3900X 12-Core Processor
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-14.0.6 (ORCJIT, znver2)
Threads: 12 on 24 virtual cores
Environment:
JULIA_EDITOR = code
JULIA_NUM_THREADS = 12
Using a sorting function with branchless comparisons, the speedup is even bigger:
speedup:
8×8 Named Matrix{Float64}
A ╲ B │ UInt8 Int8 UInt16 Int16 UInt32 Int32 UInt64 Int64
───────┼───────────────────────────────────────────────────────────────
UInt8 │ 2.78 2.37 1.9 1.8 1.73 1.64 1.47 1.47
Int8 │ 2.52 2.13 1.69 1.58 1.56 1.52 1.38 1.35
UInt16 │ 2.43 2.29 2.51 2.37 2.44 2.29 2.16 2.13
Int16 │ 2.22 2.21 2.35 2.2 2.18 2.04 2.01 1.96
UInt32 │ 2.35 2.26 2.53 2.38 2.67 2.46 2.42 2.26
Int32 │ 2.22 2.13 2.27 2.12 2.38 2.2 2.18 2.2
UInt64 │ 1.76 1.79 2.19 2.15 2.32 2.17 2.26 2.28
Int64 │ 1.72 1.72 2.09 2.0 2.15 2.07 2.2 2.15
This optimization seems not to be useful for longer tuples, because the values are reversed in memory.
Here is a script to test the correctness.
using Test
packabletypes = [UInt8,Int8,UInt16,Int16,UInt32,Int32,UInt64,Int64]
@generated function shiftwiden(a,b)
sizea = sizeof(a)
ex = :a
while sizea < sizeof(b)
ex = :(widen($ex))
sizea *= 2
end
:(widen($ex) << (8*$sizea))
end
maybe_flipsignbit(x::T) where T<:Unsigned = x
maybe_flipsignbit(x::T) where T<:Signed = reinterpret(unsigned(T), x ⊻ typemin(T))
pack(t::Tuple{T1, T2}) where {T1<:Union{packabletypes...}, T2<:Union{packabletypes...}} =
shiftwiden(t...) | maybe_flipsignbit(t[2])
newisless(a::Tuple{T1, T2}, b::Tuple{T1, T2}) where {T1<:Union{packabletypes...}, T2<:Union{packabletypes...}} =
isless(pack(a), pack(b))
getTestvals(T::Type{TT}) where {TT<:Unsigned} = [zero(T), one(T), typemax(T) ⊻ one(T), typemax(T)]
getTestvals(T::Type{TT}) where {TT<:Signed} = [typemin(T), typemin(T)+1, -one(T), zero(T), one(T), typemax(T)-1, typemax(T)]
for T1 in packabletypes, T2 in packabletypes
vals1 = getTestvals(T1)
vals2 = getTestvals(T2)
for val_1_1 in vals1, val_2_1 in vals1
for val_1_2 in vals2, val_2_2 in vals2
@test isless((val_1_1,val_1_2),(val_2_1,val_2_2)) == newisless((val_1_1,val_1_2),(val_2_1,val_2_2))
end
end
end
Is it worth it to inlude this optimization in base Julia?
Edit:
Code amenable to SIMD vectorization benefits even more:
using BenchmarkTools
using NamedArrays
versioninfo()
n = 10^3;
randomTuples!(v::AbstractArray{Tuple{T1,T2}}) where {T1,T2} = map!(_->(rand(T1),rand(T2)),v,v)
packabletypes = [UInt8,Int8,UInt16,Int16,UInt32,Int32,UInt64,Int64]
N = length(packabletypes)
function benchmarkfun!(v,f)
b = 0
for i in eachindex(v)
for j in eachindex(v)
b += f(v[i],v[j])
end
end
b
end
# benchmark base
times1 = zeros(N,N)
for (i,T1) in enumerate(packabletypes), (j,T2) in enumerate(packabletypes)
v = zip(rand(T1,n),rand(T2,n)) |> collect;
b = zeros(Int,n)
times1[i,j] = @belapsed benchmarkfun!($v,isless) setup=(randomTuples!($v)) evals=1
end
# add isless definitions
# widen a until both a and b fit, then shift a into the higher bits
@generated function shiftwiden(a,b)
sizea = sizeof(a)
ex = :a
while sizea < sizeof(b)
ex = :(widen($ex))
sizea *= 2
end
:(widen($ex) << (8*$sizea))
end
maybe_flipsignbit(x::T) where T<:Unsigned = x
maybe_flipsignbit(x::T) where T<:Signed = reinterpret(unsigned(T), x ⊻ typemin(T))
pack(t::Tuple{T1, T2}) where {T1<:Union{packabletypes...}, T2<:Union{packabletypes...}} =
shiftwiden(t...) | maybe_flipsignbit(t[2])
newisless(a::Tuple{T1, T2}, b::Tuple{T1, T2}) where {T1<:Union{packabletypes...}, T2<:Union{packabletypes...}} =
isless(pack(a), pack(b))
# benchmark modified
times2 = zeros(N,N)
for (i,T1) in enumerate(packabletypes), (j,T2) in enumerate(packabletypes)
v = zip(rand(T1,n),rand(T2,n)) |> collect;
b = zeros(Int,n)
times2[i,j] = @belapsed benchmarkfun!($v,newisless) setup=(randomTuples!($v)) evals=1
end
speedup = NamedArray(round.(times1./times2,sigdigits=3))
setnames!(speedup, string.(packabletypes), 1)
setnames!(speedup, string.(packabletypes), 2)
println("\nspeedup:")
speedup
speedup:
8×8 Named Matrix{Float64}
A ╲ B │ UInt8 Int8 UInt16 Int16 UInt32 Int32 UInt64 Int64
───────┼───────────────────────────────────────────────────────────────
UInt8 │ 7.98 5.76 7.65 6.05 3.92 3.8 3.43 2.47
Int8 │ 9.7 6.72 6.19 5.12 3.9 3.81 2.47 1.96
UInt16 │ 5.86 5.0 8.0 5.72 4.23 4.12 3.41 2.63
Int16 │ 6.53 5.7 9.36 6.59 4.18 4.2 2.51 2.08
UInt32 │ 4.0 4.01 4.24 4.05 4.35 4.55 2.59 1.8
Int32 │ 4.15 4.1 4.38 4.2 5.14 4.73 2.04 1.61
UInt64 │ 3.41 2.51 3.49 2.79 2.42 2.33 2.61 2.63
Int64 │ 2.48 2.35 2.57 2.46 1.83 1.83 2.01 2.06
Extra benchmark of the regular sort! on different hardware
Julia Version 1.9.0-beta3
Commit 24204a7344 (2023-01-18 07:20 UTC)
Platform Info:
OS: Windows (x86_64-w64-mingw32)
CPU: 16 × Intel(R) Core(TM) i7-7820X CPU @ 3.60GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-14.0.6 (ORCJIT, skylake-avx512)
Threads: 1 on 16 virtual cores
speedup:
8×8 Named Matrix{Float64}
A ╲ B │ UInt8 Int8 UInt16 Int16 UInt32 Int32 UInt64 Int64
───────┼───────────────────────────────────────────────────────────────
UInt8 │ 2.43 1.97 2.11 1.64 1.65 1.62 1.14 1.07
Int8 │ 1.97 2.0 1.88 1.89 1.61 1.59 1.1 1.12
UInt16 │ 2.25 2.07 2.35 2.12 2.04 2.14 1.77 1.68
Int16 │ 1.81 1.92 2.29 2.06 1.89 1.95 1.76 1.69
UInt32 │ 1.72 1.62 2.07 1.84 2.17 1.87 1.95 1.77
Int32 │ 1.77 1.75 2.02 1.91 1.94 1.92 1.82 1.65
UInt64 │ 1.17 1.16 1.55 1.61 1.96 1.73 1.91 1.79
Int64 │ 1.15 1.12 1.62 1.55 1.63 1.56 1.9 1.74