You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello there, thank you for the great job you are doing.
I recently implemented sort! in Metal and KA, after I had taught the Bitonic Sort for over 25 years in advanced parallel programming classes...
The two implementations are identical except for two parts.
The semantics of ndrange differ from the "number of blocks/grid dimension or size" I was expecting, as it is typical in CUDA and Metal kernel invocations.
In KA I need to get_backend within the bitonic sorting main function.
Benchmarking shows that KA is about 10% faster than Metal for large enough problems even though it allocates about 40% more memory!
I have not released the codes yet --I am trying to use the local storage, but so far it has not improved the time-- if you think they can be of value as they are, I will do so.
The results are from an Apple M2 Max (Apple MacBook Pro Laptop) running julia -t auto.
julia>include("benchmark/benchtest.jl")
T = Float32
n =262144
sort! 1.047 ms (6 allocations:1.01 MiB)
ThreadsX.sort! 748.041 μs (3706 allocations:2.68 MiB)
Mtl! 948.833 μs (3228 allocations:83.59 KiB)
KA! 1.166 ms (4189 allocations:114.28 KiB)
n =524288
sort! 2.135 ms (6 allocations:2.01 MiB)
ThreadsX.sort! 1.533 ms (7856 allocations:5.53 MiB)
Mtl! 1.138 ms (4050 allocations:104.84 KiB)
KA! 1.489 ms (5252 allocations:143.21 KiB)
n =1048576
sort! 4.482 ms (6 allocations:4.01 MiB)
ThreadsX.sort! 3.272 ms (16685 allocations:11.14 MiB)
Mtl! 1.360 ms (5075 allocations:130.20 KiB)
KA! 1.907 ms (6545 allocations:177.12 KiB)
n =2097152
sort! 9.298 ms (6 allocations:8.01 MiB)
ThreadsX.sort! 7.179 ms (36344 allocations:22.44 MiB)
Mtl! 3.092 ms (6203 allocations:158.67 KiB)
KA! 4.220 ms (7980 allocations:215.42 KiB)
n =4194304
sort! 20.118 ms (6 allocations:16.01 MiB)
ThreadsX.sort! 15.731 ms (77714 allocations:45.06 MiB)
Mtl! 7.263 ms (7550 allocations:194.66 KiB)
KA! 6.314 ms (9721 allocations:264.20 KiB)
n =8388608
sort! 40.891 ms (6 allocations:32.01 MiB)
ThreadsX.sort! 33.566 ms (161908 allocations:90.30 MiB)
Mtl! 16.692 ms (9018 allocations:233.92 KiB)
KA! 14.296 ms (11616 allocations:317.34 KiB)
n =16777216
sort! 84.110 ms (6 allocations:64.01 MiB)
ThreadsX.sort! 72.496 ms (341686 allocations:181.50 MiB)
Mtl! 38.724 ms (10601 allocations:276.27 KiB)
KA! 34.344 ms (13659 allocations:374.62 KiB)
n =33554432
sort! 175.927 ms (6 allocations:128.01 MiB)
ThreadsX.sort! 156.675 ms (704410 allocations:363.12 MiB)
Mtl! 88.976 ms (12299 allocations:321.69 KiB)
KA! 80.159 ms (15850 allocations:436.05 KiB)
n =67108864
sort! 353.998 ms (6 allocations:256.01 MiB)
ThreadsX.sort! 338.924 ms (1479407 allocations:667.98 MiB)
Mtl! 199.784 ms (14118 allocations:370.39 KiB)
KA! 181.099 ms (18195 allocations:501.83 KiB)
T = Int32
n =262144
sort! 1.189 ms (6 allocations:1.00 MiB)
ThreadsX.sort! 586.000 μs (3760 allocations:2.78 MiB)
Mtl! 936.959 μs (3228 allocations:83.59 KiB)
KA! 1.200 ms (4189 allocations:114.28 KiB)
n =524288
sort! 2.399 ms (6 allocations:2.00 MiB)
ThreadsX.sort! 1.173 ms (7730 allocations:5.50 MiB)
Mtl! 1.207 ms (4050 allocations:104.84 KiB)
KA! 1.557 ms (5252 allocations:143.21 KiB)
n =1048576
sort! 4.906 ms (6 allocations:4.00 MiB)
ThreadsX.sort! 2.501 ms (16596 allocations:11.12 MiB)
Mtl! 1.356 ms (5075 allocations:130.20 KiB)
KA! 1.943 ms (6545 allocations:177.12 KiB)
n =2097152
sort! 10.555 ms (6 allocations:8.00 MiB)
ThreadsX.sort! 5.247 ms (36175 allocations:22.18 MiB)
Mtl! 3.082 ms (6203 allocations:158.67 KiB)
KA! 2.799 ms (7980 allocations:215.42 KiB)
n =4194304
sort! 22.473 ms (6 allocations:16.00 MiB)
ThreadsX.sort! 11.464 ms (76687 allocations:44.64 MiB)
Mtl! 7.245 ms (7550 allocations:194.66 KiB)
KA! 6.487 ms (9721 allocations:264.20 KiB)
n =8388608
sort! 47.892 ms (6 allocations:32.00 MiB)
ThreadsX.sort! 25.140 ms (164254 allocations:90.53 MiB)
Mtl! 16.715 ms (9018 allocations:233.92 KiB)
KA! 14.329 ms (11616 allocations:317.34 KiB)
n =16777216
sort! 100.438 ms (6 allocations:64.00 MiB)
ThreadsX.sort! 52.938 ms (339660 allocations:180.83 MiB)
Mtl! 38.846 ms (10601 allocations:276.27 KiB)
KA! 34.601 ms (13659 allocations:374.62 KiB)
n =33554432
sort! 237.417 ms (6 allocations:128.00 MiB)
ThreadsX.sort! 115.156 ms (718296 allocations:365.63 MiB)
Mtl! 88.776 ms (12300 allocations:321.70 KiB)
KA! 80.378 ms (15852 allocations:436.08 KiB)
n =67108864
sort! 643.892 ms (6 allocations:256.00 MiB)
ThreadsX.sort! 243.369 ms (1517325 allocations:772.68 MiB)
Mtl! 199.257 ms (14124 allocations:370.48 KiB)
KA! 180.021 ms (18199 allocations:501.89 KiB)
T = Int64
n =262144
sort! 2.312 ms (6 allocations:2.01 MiB)
ThreadsX.sort! 852.792 μs (3424 allocations:4.70 MiB)
Mtl! 927.041 μs (3228 allocations:83.59 KiB)
KA! 1.200 ms (4189 allocations:114.28 KiB)
n =524288
sort! 4.728 ms (6 allocations:4.01 MiB)
ThreadsX.sort! 1.696 ms (7640 allocations:9.57 MiB)
Mtl! 1.106 ms (4050 allocations:104.84 KiB)
KA! 1.441 ms (5252 allocations:143.21 KiB)
n =1048576
sort! 9.878 ms (6 allocations:8.01 MiB)
ThreadsX.sort! 3.495 ms (16651 allocations:19.51 MiB)
Mtl! 1.394 ms (5075 allocations:130.20 KiB)
KA! 1.708 ms (6545 allocations:177.12 KiB)
n =2097152
sort! 22.117 ms (6 allocations:16.01 MiB)
ThreadsX.sort! 7.686 ms (36340 allocations:39.68 MiB)
Mtl! 3.447 ms (6203 allocations:158.67 KiB)
KA! 3.436 ms (7981 allocations:215.44 KiB)
n =4194304
sort! 49.599 ms (6 allocations:32.01 MiB)
ThreadsX.sort! 16.572 ms (76427 allocations:79.96 MiB)
Mtl! 8.649 ms (7552 allocations:194.69 KiB)
KA! 7.721 ms (9721 allocations:264.20 KiB)
n =8388608
sort! 100.277 ms (6 allocations:64.01 MiB)
ThreadsX.sort! 35.100 ms (161905 allocations:158.08 MiB)
Mtl! 23.233 ms (9018 allocations:233.92 KiB)
KA! 21.447 ms (11616 allocations:317.34 KiB)
n =16777216
sort! 212.546 ms (6 allocations:128.01 MiB)
ThreadsX.sort! 76.809 ms (351153 allocations:317.39 MiB)
Mtl! 57.760 ms (10601 allocations:276.27 KiB)
KA! 53.755 ms (13659 allocations:374.62 KiB)
n =33554432
sort! 440.911 ms (6 allocations:256.01 MiB)
ThreadsX.sort! 161.302 ms (720982 allocations:637.16 MiB)
Mtl! 138.131 ms (12299 allocations:321.69 KiB)
KA! 129.067 ms (15850 allocations:436.05 KiB)
n =67108864
sort! 951.172 ms (6 allocations:512.01 MiB)
ThreadsX.sort! 372.092 ms (1524955 allocations:1.25 GiB)
Mtl! 320.788 ms (14118 allocations:370.39 KiB)
KA! 299.398 ms (18195 allocations:501.83 KiB)
julia> Sys.CPU_THREADS
8
julia> Base.Sys.MACHINE
"arm64-apple-darwin24.0.0"
julia> Base.Sys.cpu_info
cpu_info (generic function with 1 method)
julia> Base.Sys.cpu_info()
12-element Vector{Base.Sys.CPUinfo}:
Base.Sys.CPUinfo("Apple M2 Max", 2400, 0x0000000001077cd0, 0x0000000000000000, 0x00000000009e1b0a, 0x00000000029a5c48, 0x0000000000000000)
Base.Sys.CPUinfo("Apple M2 Max", 2400, 0x0000000001016f2a, 0x0000000000000000, 0x00000000008c9d1c, 0x0000000002b29c86, 0x0000000000000000)
Base.Sys.CPUinfo("Apple M2 Max", 2400, 0x0000000000f39e18, 0x0000000000000000, 0x0000000000799636, 0x0000000002d47cf2, 0x0000000000000000)
Base.Sys.CPUinfo("Apple M2 Max", 2400, 0x0000000000e5a2d6, 0x0000000000000000, 0x000000000069ff1e, 0x0000000002f2e2d2, 0x0000000000000000)
Base.Sys.CPUinfo("Apple M2 Max", 2400, 0x00000000002df0f0, 0x0000000000000000, 0x00000000000f1f86, 0x0000000004089428, 0x0000000000000000)
Base.Sys.CPUinfo("Apple M2 Max", 2400, 0x0000000000216ca4, 0x0000000000000000, 0x000000000007cb28, 0x00000000041cc25e, 0x0000000000000000)
Base.Sys.CPUinfo("Apple M2 Max", 2400, 0x0000000000192c7e, 0x0000000000000000, 0x0000000000049be2, 0x00000000042862f8, 0x0000000000000000)
Base.Sys.CPUinfo("Apple M2 Max", 2400, 0x00000000001436e2, 0x0000000000000000, 0x0000000000035d22, 0x00000000042ebc84, 0x0000000000000000)
Base.Sys.CPUinfo("Apple M2 Max", 2400, 0x00000000002e7b92, 0x0000000000000000, 0x00000000000f07da, 0x0000000004081df4, 0x0000000000000000)
Base.Sys.CPUinfo("Apple M2 Max", 2400, 0x0000000000220cc2, 0x0000000000000000, 0x000000000007c6a0, 0x00000000041c2614, 0x0000000000000000)
Base.Sys.CPUinfo("Apple M2 Max", 2400, 0x000000000018b7ee, 0x0000000000000000, 0x0000000000049192, 0x000000000428e2fa, 0x0000000000000000)
Base.Sys.CPUinfo("Apple M2 Max", 2400, 0x00000000001437f0, 0x0000000000000000, 0x0000000000035aa2, 0x00000000042ebfcc, 0x0000000000000000)
julia> Base.VERSIONv"1.11.4"
The text was updated successfully, but these errors were encountered:
Hello there, thank you for the great job you are doing.
I recently implemented
sort!
inMetal
andKA
, after I had taught the Bitonic Sort for over 25 years in advanced parallel programming classes...The two implementations are identical except for two parts.
ndrange
differ from the "number of blocks/grid dimension or size" I was expecting, as it is typical inCUDA
andMetal
kernel invocations.KA
I need toget_backend
within the bitonic sorting main function.Benchmarking shows that KA is about 10% faster than Metal for large enough problems even though it allocates about 40% more memory!
I have not released the codes yet --I am trying to use the local storage, but so far it has not improved the time-- if you think they can be of value as they are, I will do so.
The results are from an Apple M2 Max (Apple MacBook Pro Laptop) running
julia -t auto
.The text was updated successfully, but these errors were encountered: