Skip to content

Alignment crash for unsafe loads/stores #2963

@epilliat

Description

@epilliat

When using unsafe_load/unsafe_store! with alignment of Val(size) that are not powers of 2 (e.g., Float3 = 12 bytes) at index >= 2, the julia REPL crashes.
A workaround is to specify an alignment that is a power of two and divides the size of the struct but it leads to suboptimal loads/stores.

For example, for Float3, it works well in one of these two cases:

  • We unsafe_load/unsafe_store at index 1 with alignment 12
  • We unsafe_load/unsafe_store at index > 1 with alignment 4 (power of 2 that divides 12)

Minimal Reproducible Example

using CUDA

struct Float3
    x::Float32
    y::Float32
    z::Float32
end

function kernel(a, b, i)
    Nitem = 1
    ptr_a = reinterpret(Core.LLVMPtr{Float3, AS.Global}, pointer(a))
    ptr_b = reinterpret(Core.LLVMPtr{Float3, AS.Global}, pointer(b))
    sz = Nitem * sizeof(Float3)  # sz = 12
    values = unsafe_load(ptr_a, i, Val(sz))
    unsafe_store!(ptr_b, values, i, Val(sz))
    return
end

a = cu([Float3(1, 1, 1) for _ in 1:8])
b = cu([Float3(0, 0, 0) for _ in 1:8])

# Works and optimizes with one v2.f32 load and one normal .f32 load (see below)
@cuda kernel(a, b, 1)

# CRASHES
#@cuda kernel(a, b, 2)

# Optimized v2 + normal load for index 1

buf = IOBuffer()

a = cu([Float3(1, 1, 1) for _ in (1:8)])
b = cu([Float3(0, 0, 0) for _ in (1:8)])
@cuda kernel(a, b, 1) # It crashes if we put kernel(a, b, 2) instead
@device_code_ptx io = buf @cuda kernel(a, b, 1)

asm = String(take!(copy(buf)))
occursin("ld.global.v2", asm) #true
occursin("ld.global.f32 ", asm) #true
occursin("st.global.v2", asm) #true

Expected Behavior

Should work correctly for any valid index (1-8 in this example) and alignment 12 with one v2 load and one normal load.

Actual Behavior

Works well and lead to optimal vectorized loads/stores (one v2 and one .f32 in each case) for index i=1
Crash: when accessing index 2 with alignment 12

Suboptimal Workaround

Setting sz = Nitem * 4 (power of 2) instead of sizeof(Float3) prevents crashes but it gives 3 normal .f32 loads instead of one v2.f32 + one .f32

Related

This works correctly for Float4 (16 bytes, power of 2), which generates optimal ld.global.v4.f32 instructions. The issue appears specific to non-power-of-2 struct sizes.

Version info

Details on Julia:

versioninfo()
Julia Version 1.11.7
Commit f2b3dbda30a (2025-09-08 12:10 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 22 × Intel(R) Core(TM) Ultra 7 165H
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, alderlake)
Threads: 22 default, 0 interactive, 11 GC (on 22 virtual cores)
Environment:
  JULIA_EDITOR = code
  JULIA_VSCODE_REPL = 1
  JULIA_NUM_THREADS = 22

Details on CUDA:

CUDA.versioninfo()
CUDA toolchain: 
- runtime 13.0, artifact installation
- driver 550.163.1 for 13.0
- compiler 13.0

CUDA libraries: 
- CUBLAS: 13.1.0
- CURAND: 10.4.0
- CUFFT: 12.0.0
- CUSOLVER: 12.0.4
- CUSPARSE: 12.6.3
- CUPTI: 2025.3.1 (API 13.0.1)
- NVML: 12.0.0+550.163.1

Julia packages: 
- CUDA: 5.9.2
- CUDA_Driver_jll: 13.0.2+0
- CUDA_Compiler_jll: 0.3.0+0
- CUDA_Runtime_jll: 0.19.2+0

Toolchain:
- Julia: 1.11.7
- LLVM: 16.0.6

1 device:
  0: NVIDIA RTX 1000 Ada Generation Laptop GPU (sm_89, 5.472 GiB / 5.997 GiB available)

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions