Skip to content

2x faster tuple sorting (for 2-Tuples of Integers) #48724

Closed
@LSchwerdt

Description

@LSchwerdt

TDLR: isless for 2-Tuples of Integers is significantly faster if implemented by packing both values into a longer Integer.

While experimenting with ways to improve the performance of sortperm, I noticed that sorting tuples of UInt64 is much slower than sorting UInt128s. But tuples of UInt64s can be packed into a single UInt128:

pack(x::Tuple{UInt64, UInt64}) = UInt128(x[1])<<64 | x[2]

Sorting 10^7 tuples using by=pack is almost twice as fast on my machine compared to the default.

This can also be seen in the assembly code of the default isless and a modified version with packed integers. Not only is the packed version shorter, it is also branchless:

Assembly for default `isless`
# %bb.0:                                # %top
      pushq   %rbp
      .cfi_def_cfa_offset 16
      .cfi_offset %rbp, -16
      movq    %rsp, %rbp
      .cfi_def_cfa_register %rbp
; │ @ tuple.jl:541 within `isless` @ operators.jl:421
; │┌ @ int.jl:487 within `<`
      movq    (%rcx), %r8
      movq    (%rdx), %r9
      movb    $1, %al
      cmpq    %r9, %r8
; │└
; │ @ tuple.jl:541 within `isless`
      jb      .LBB0_4
# %bb.1:                                # %L6
      jne     .LBB0_2
# %bb.3:                                # %L8
; │ @ tuple.jl:541 within `isless` @ tuple.jl:541 @ operators.jl:421
; │┌ @ int.jl:487 within `<`
      movq    8(%rcx), %rax
      cmpq    8(%rdx), %rax
      setb    %al
.LBB0_4:                                # %common.ret
; │└
; │ @ tuple.jl:541 within `isless`
                                      # kill: def $al killed $al killed $eax       
      popq    %rbp
      retq
.LBB0_2:
      xorl    %eax, %eax
                                      # kill: def $al killed $al killed $eax       
      popq    %rbp
      retq
.Lfunc_end0:
      .size   julia_isless_1264, .Lfunc_end0-julia_isless_1264
      .cfi_endproc
; └
                                      # -- End function
Assembly for packed `isless`
# %bb.0:                                # %top
      pushq   %rbp
      .cfi_def_cfa_offset 16
      .cfi_offset %rbp, -16
      movq    %rsp, %rbp
      .cfi_def_cfa_register %rbp
; │┌ @ d:\Julia_dev\sorting\tuplecompare\widen3.jl:17 within `pack`
; ││┌ @ d:\Julia_dev\sorting\tuplecompare\widen3.jl:5 within `shiftwiden`      
; │││┌ @ d:\Julia_dev\sorting\tuplecompare\widen3.jl:5 within `macro expansion`
; ││││┌ @ operators.jl:882 within `widen`
; │││││┌ @ number.jl:7 within `convert`
; ││││││┌ @ boot.jl:790 within `UInt128`
; │││││││┌ @ boot.jl:775 within `toUInt128`
      movq    (%rcx), %rax
; ││└└└└└└
; ││┌ @ int.jl:1040 within `|`
; │││┌ @ int.jl:525 within `rem`
; ││││┌ @ number.jl:7 within `convert`
; │││││┌ @ boot.jl:790 within `UInt128`
; ││││││┌ @ boot.jl:775 within `toUInt128`
      movq    8(%rcx), %rcx
; │└└└└└└
; │ @ d:\Julia_dev\sorting\tuplecompare\widen3.jl:21 within `isless` @ operators.jl:421
; │┌ @ int.jl:487 within `<`
      cmpq    8(%rdx), %rcx
      sbbq    (%rdx), %rax
      setb    %al
; │└
; │ @ d:\Julia_dev\sorting\tuplecompare\widen3.jl:21 within `isless`
      popq    %rbp
      retq
.Lfunc_end0:
      .size   julia_isless_1694, .Lfunc_end0-julia_isless_1694
      .cfi_endproc
; └
                                      # -- End function

So I generalized this for all Integers up to 64 bit, and using

this benchmarking script
using BenchmarkTools
using NamedArrays

n = 10^7;
randomTuples!(v::AbstractArray{Tuple{T1,T2}}) where {T1,T2} = map!(_->(rand(T1),rand(T2)),v,v)

packabletypes = [UInt8,Int8,UInt16,Int16,UInt32,Int32,UInt64,Int64]
N = length(packabletypes)

# benchmark base
times1 = zeros(N,N)
for (i,T1) in enumerate(packabletypes), (j,T2) in enumerate(packabletypes)
    v = zip(rand(T1,n),rand(T2,n)) |> collect;
    times1[i,j] = @belapsed sort!($v) setup=(randomTuples!($v)) evals=1
end

# add isless definitions
import Base.isless
# widen a until both a and b fit, then shift a into the higher bits
@generated function shiftwiden(a,b)
    sizea = sizeof(a)
    ex = :a
    while sizea < sizeof(b)
        ex = :(widen($ex))
        sizea *= 2
    end
    :(widen($ex) << (8*$sizea))    
end
maybe_flipsignbit(x::T) where T<:Unsigned = x
maybe_flipsignbit(x::T) where T<:Signed = reinterpret(unsigned(T), x  typemin(T))

pack(t::Tuple{T1, T2}) where {T1<:Union{packabletypes...}, T2<:Union{packabletypes...}} = 
shiftwiden(t...) | maybe_flipsignbit(t[2])

isless(a::Tuple{T1, T2}, b::Tuple{T1, T2}) where {T1<:Union{packabletypes...}, T2<:Union{packabletypes...}} =
 isless(pack(a), pack(b))

# benchmark modified
times2 = zeros(N,N)
for (i,T1) in enumerate(packabletypes), (j,T2) in enumerate(packabletypes)
    v = zip(rand(T1,n),rand(T2,n)) |> collect;
    times2[i,j] = @belapsed sort!($v) setup=(randomTuples!($v)) evals=1
end

speedup = NamedArray(round.(times1./times2,sigdigits=3))
setnames!(speedup, string.(packabletypes), 1)
setnames!(speedup, string.(packabletypes), 2)
println("\nspeedup:")
speedup

I get the following speedups for sorting 10^7 tuples, depending on the element types:

speedup:
8×8 Named Matrix{Float64}
 A ╲ B │  UInt8    Int8  UInt16   Int16  UInt32   Int32  UInt64   Int64
───────┼───────────────────────────────────────────────────────────────
UInt8  │   2.39     2.2    2.12    1.98     1.7    1.67     1.0   0.995
Int8   │   2.15    1.91    1.88    1.73    1.55    1.58   0.968   0.966
UInt16 │   2.32    2.24    2.35    2.21    2.21    2.02    1.67    1.65
Int16  │   2.09    1.96    2.16    2.02    1.95    1.91    1.67    1.52
UInt32 │   1.82    1.82     2.1    2.03    2.22    2.05    1.96    1.75
Int32  │   1.79    1.65    1.86    1.78     1.9    1.81     1.7     1.6
UInt64 │   1.08    1.09    1.62    1.62    1.94    1.83    1.96    1.82
Int64  │  0.969   0.983    1.56    1.57    1.77    1.67    1.72    1.63
(on this system)
Julia Version 1.9.0-beta4
Commit b75ddb787f (2023-02-07 21:53 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 24 × AMD Ryzen 9 3900X 12-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, znver2)
  Threads: 12 on 24 virtual cores
Environment:
  JULIA_EDITOR = code
  JULIA_NUM_THREADS = 12

Using a sorting function with branchless comparisons, the speedup is even bigger:

speedup:
8×8 Named Matrix{Float64}
 A ╲ B │  UInt8    Int8  UInt16   Int16  UInt32   Int32  UInt64   Int64
───────┼───────────────────────────────────────────────────────────────
UInt8  │   2.78    2.37     1.9     1.8    1.73    1.64    1.47    1.47
Int8   │   2.52    2.13    1.69    1.58    1.56    1.52    1.38    1.35
UInt16 │   2.43    2.29    2.51    2.37    2.44    2.29    2.16    2.13
Int16  │   2.22    2.21    2.35     2.2    2.18    2.04    2.01    1.96
UInt32 │   2.35    2.26    2.53    2.38    2.67    2.46    2.42    2.26
Int32  │   2.22    2.13    2.27    2.12    2.38     2.2    2.18     2.2
UInt64 │   1.76    1.79    2.19    2.15    2.32    2.17    2.26    2.28
Int64  │   1.72    1.72    2.09     2.0    2.15    2.07     2.2    2.15

This optimization seems not to be useful for longer tuples, because the values are reversed in memory.

Here is a script to test the correctness.
using Test

packabletypes = [UInt8,Int8,UInt16,Int16,UInt32,Int32,UInt64,Int64]

@generated function shiftwiden(a,b)
    sizea = sizeof(a)
    ex = :a
    while sizea < sizeof(b)
        ex = :(widen($ex))
        sizea *= 2
    end
    :(widen($ex) << (8*$sizea))    
end
maybe_flipsignbit(x::T) where T<:Unsigned = x
maybe_flipsignbit(x::T) where T<:Signed = reinterpret(unsigned(T), x  typemin(T))

pack(t::Tuple{T1, T2}) where {T1<:Union{packabletypes...}, T2<:Union{packabletypes...}} = 
shiftwiden(t...) | maybe_flipsignbit(t[2])

newisless(a::Tuple{T1, T2}, b::Tuple{T1, T2}) where {T1<:Union{packabletypes...}, T2<:Union{packabletypes...}} =
 isless(pack(a), pack(b))

getTestvals(T::Type{TT}) where {TT<:Unsigned} = [zero(T), one(T), typemax(T)  one(T), typemax(T)]
getTestvals(T::Type{TT}) where {TT<:Signed} = [typemin(T), typemin(T)+1, -one(T), zero(T), one(T), typemax(T)-1, typemax(T)]

for T1 in packabletypes, T2 in packabletypes
    vals1 = getTestvals(T1)
    vals2 = getTestvals(T2)
    for val_1_1 in vals1, val_2_1 in vals1
        for val_1_2 in vals2, val_2_2 in vals2
            @test isless((val_1_1,val_1_2),(val_2_1,val_2_2)) == newisless((val_1_1,val_1_2),(val_2_1,val_2_2))
        end
    end
end

Is it worth it to inlude this optimization in base Julia?

Edit:

Code amenable to SIMD vectorization benefits even more:
using BenchmarkTools
using NamedArrays

versioninfo()
n = 10^3;
randomTuples!(v::AbstractArray{Tuple{T1,T2}}) where {T1,T2} = map!(_->(rand(T1),rand(T2)),v,v)

packabletypes = [UInt8,Int8,UInt16,Int16,UInt32,Int32,UInt64,Int64]
N = length(packabletypes)

function benchmarkfun!(v,f)
    b = 0
    for i in eachindex(v)
        for j in eachindex(v)
        b += f(v[i],v[j])
        end
    end
    b
end

# benchmark base
times1 = zeros(N,N)
for (i,T1) in enumerate(packabletypes), (j,T2) in enumerate(packabletypes)
    v = zip(rand(T1,n),rand(T2,n)) |> collect;
    b = zeros(Int,n)
    times1[i,j] = @belapsed benchmarkfun!($v,isless) setup=(randomTuples!($v)) evals=1
end

# add isless definitions
# widen a until both a and b fit, then shift a into the higher bits
@generated function shiftwiden(a,b)
    sizea = sizeof(a)
    ex = :a
    while sizea < sizeof(b)
        ex = :(widen($ex))
        sizea *= 2
    end
    :(widen($ex) << (8*$sizea))    
end
maybe_flipsignbit(x::T) where T<:Unsigned = x
maybe_flipsignbit(x::T) where T<:Signed = reinterpret(unsigned(T), x  typemin(T))

pack(t::Tuple{T1, T2}) where {T1<:Union{packabletypes...}, T2<:Union{packabletypes...}} = 
shiftwiden(t...) | maybe_flipsignbit(t[2])

newisless(a::Tuple{T1, T2}, b::Tuple{T1, T2}) where {T1<:Union{packabletypes...}, T2<:Union{packabletypes...}} =
 isless(pack(a), pack(b))

# benchmark modified
times2 = zeros(N,N)
for (i,T1) in enumerate(packabletypes), (j,T2) in enumerate(packabletypes)
    v = zip(rand(T1,n),rand(T2,n)) |> collect;
    b = zeros(Int,n)
    times2[i,j] = @belapsed benchmarkfun!($v,newisless) setup=(randomTuples!($v)) evals=1
end

speedup = NamedArray(round.(times1./times2,sigdigits=3))
setnames!(speedup, string.(packabletypes), 1)
setnames!(speedup, string.(packabletypes), 2)
println("\nspeedup:")
speedup
speedup:
8×8 Named Matrix{Float64}
 A ╲ B │  UInt8    Int8  UInt16   Int16  UInt32   Int32  UInt64   Int64
───────┼───────────────────────────────────────────────────────────────
UInt8  │   7.98    5.76    7.65    6.05    3.92     3.8    3.43    2.47
Int8   │    9.7    6.72    6.19    5.12     3.9    3.81    2.47    1.96
UInt16 │   5.86     5.0     8.0    5.72    4.23    4.12    3.41    2.63
Int16  │   6.53     5.7    9.36    6.59    4.18     4.2    2.51    2.08
UInt32 │    4.0    4.01    4.24    4.05    4.35    4.55    2.59     1.8
Int32  │   4.15     4.1    4.38     4.2    5.14    4.73    2.04    1.61
UInt64 │   3.41    2.51    3.49    2.79    2.42    2.33    2.61    2.63
Int64  │   2.48    2.35    2.57    2.46    1.83    1.83    2.01    2.06
Extra benchmark of the regular sort! on different hardware
Julia Version 1.9.0-beta3
Commit 24204a7344 (2023-01-18 07:20 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 16 × Intel(R) Core(TM) i7-7820X CPU @ 3.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, skylake-avx512)
  Threads: 1 on 16 virtual cores

speedup:
8×8 Named Matrix{Float64}
 A ╲ B │  UInt8    Int8  UInt16   Int16  UInt32   Int32  UInt64   Int64
───────┼───────────────────────────────────────────────────────────────
UInt8  │   2.43    1.97    2.11    1.64    1.65    1.62    1.14    1.07
Int8   │   1.97     2.0    1.88    1.89    1.61    1.59     1.1    1.12
UInt16 │   2.25    2.07    2.35    2.12    2.04    2.14    1.77    1.68
Int16  │   1.81    1.92    2.29    2.06    1.89    1.95    1.76    1.69
UInt32 │   1.72    1.62    2.07    1.84    2.17    1.87    1.95    1.77
Int32  │   1.77    1.75    2.02    1.91    1.94    1.92    1.82    1.65
UInt64 │   1.17    1.16    1.55    1.61    1.96    1.73    1.91    1.79
Int64  │   1.15    1.12    1.62    1.55    1.63    1.56     1.9    1.74

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions