-
Notifications
You must be signed in to change notification settings - Fork 256
Description
When using unsafe_load/unsafe_store! with alignment of Val(size) that are not powers of 2 (e.g., Float3 = 12 bytes) at index >= 2, the julia REPL crashes.
A workaround is to specify an alignment that is a power of two and divides the size of the struct but it leads to suboptimal loads/stores.
For example, for Float3, it works well in one of these two cases:
- We unsafe_load/unsafe_store at index 1 with alignment 12
- We unsafe_load/unsafe_store at index > 1 with alignment 4 (power of 2 that divides 12)
Minimal Reproducible Example
using CUDA
struct Float3
x::Float32
y::Float32
z::Float32
end
function kernel(a, b, i)
Nitem = 1
ptr_a = reinterpret(Core.LLVMPtr{Float3, AS.Global}, pointer(a))
ptr_b = reinterpret(Core.LLVMPtr{Float3, AS.Global}, pointer(b))
sz = Nitem * sizeof(Float3) # sz = 12
values = unsafe_load(ptr_a, i, Val(sz))
unsafe_store!(ptr_b, values, i, Val(sz))
return
end
a = cu([Float3(1, 1, 1) for _ in 1:8])
b = cu([Float3(0, 0, 0) for _ in 1:8])
# Works and optimizes with one v2.f32 load and one normal .f32 load (see below)
@cuda kernel(a, b, 1)
# CRASHES
#@cuda kernel(a, b, 2)
# Optimized v2 + normal load for index 1
buf = IOBuffer()
a = cu([Float3(1, 1, 1) for _ in (1:8)])
b = cu([Float3(0, 0, 0) for _ in (1:8)])
@cuda kernel(a, b, 1) # It crashes if we put kernel(a, b, 2) instead
@device_code_ptx io = buf @cuda kernel(a, b, 1)
asm = String(take!(copy(buf)))
occursin("ld.global.v2", asm) #true
occursin("ld.global.f32 ", asm) #true
occursin("st.global.v2", asm) #trueExpected Behavior
Should work correctly for any valid index (1-8 in this example) and alignment 12 with one v2 load and one normal load.
Actual Behavior
Works well and lead to optimal vectorized loads/stores (one v2 and one .f32 in each case) for index i=1
Crash: when accessing index 2 with alignment 12
Suboptimal Workaround
Setting sz = Nitem * 4 (power of 2) instead of sizeof(Float3) prevents crashes but it gives 3 normal .f32 loads instead of one v2.f32 + one .f32
Related
This works correctly for Float4 (16 bytes, power of 2), which generates optimal ld.global.v4.f32 instructions. The issue appears specific to non-power-of-2 struct sizes.
Version info
Details on Julia:
versioninfo()
Julia Version 1.11.7
Commit f2b3dbda30a (2025-09-08 12:10 UTC)
Build Info:
Official https://julialang.org/ release
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 22 × Intel(R) Core(TM) Ultra 7 165H
WORD_SIZE: 64
LLVM: libLLVM-16.0.6 (ORCJIT, alderlake)
Threads: 22 default, 0 interactive, 11 GC (on 22 virtual cores)
Environment:
JULIA_EDITOR = code
JULIA_VSCODE_REPL = 1
JULIA_NUM_THREADS = 22
Details on CUDA:
CUDA.versioninfo()
CUDA toolchain:
- runtime 13.0, artifact installation
- driver 550.163.1 for 13.0
- compiler 13.0
CUDA libraries:
- CUBLAS: 13.1.0
- CURAND: 10.4.0
- CUFFT: 12.0.0
- CUSOLVER: 12.0.4
- CUSPARSE: 12.6.3
- CUPTI: 2025.3.1 (API 13.0.1)
- NVML: 12.0.0+550.163.1
Julia packages:
- CUDA: 5.9.2
- CUDA_Driver_jll: 13.0.2+0
- CUDA_Compiler_jll: 0.3.0+0
- CUDA_Runtime_jll: 0.19.2+0
Toolchain:
- Julia: 1.11.7
- LLVM: 16.0.6
1 device:
0: NVIDIA RTX 1000 Ada Generation Laptop GPU (sm_89, 5.472 GiB / 5.997 GiB available)