Alignment crash for unsafe loads/stores




When using `unsafe_load`/`unsafe_store!` with alignment of `Val(size)` that are not powers of 2 (e.g., `Float3` = 12 bytes) at index >= 2, the julia REPL crashes. 
A workaround is to specify an alignment that is a power of two and divides the size of the struct but it leads to suboptimal loads/stores.

For example, for Float3, it **works well** in one of these two cases:
- We unsafe_load/unsafe_store at **index 1** with alignment 12
- We unsafe_load/unsafe_store at index > 1 with **alignment 4** (power of 2 that divides 12)


### Minimal Reproducible Example
```julia
using CUDA

struct Float3
    x::Float32
    y::Float32
    z::Float32
end

function kernel(a, b, i)
    Nitem = 1
    ptr_a = reinterpret(Core.LLVMPtr{Float3, AS.Global}, pointer(a))
    ptr_b = reinterpret(Core.LLVMPtr{Float3, AS.Global}, pointer(b))
    sz = Nitem * sizeof(Float3)  # sz = 12
    values = unsafe_load(ptr_a, i, Val(sz))
    unsafe_store!(ptr_b, values, i, Val(sz))
    return
end

a = cu([Float3(1, 1, 1) for _ in 1:8])
b = cu([Float3(0, 0, 0) for _ in 1:8])

# Works and optimizes with one v2.f32 load and one normal .f32 load (see below)
@cuda kernel(a, b, 1)

# CRASHES
#@cuda kernel(a, b, 2)

# Optimized v2 + normal load for index 1

buf = IOBuffer()

a = cu([Float3(1, 1, 1) for _ in (1:8)])
b = cu([Float3(0, 0, 0) for _ in (1:8)])
@cuda kernel(a, b, 1) # It crashes if we put kernel(a, b, 2) instead
@device_code_ptx io = buf @cuda kernel(a, b, 1)

asm = String(take!(copy(buf)))
occursin("ld.global.v2", asm) #true
occursin("ld.global.f32 ", asm) #true
occursin("st.global.v2", asm) #true
```

### Expected Behavior

Should work correctly for any valid index (1-8 in this example) and alignment 12 with one v2 load and one normal load.

### Actual Behavior

Works well and lead to optimal vectorized loads/stores (one v2 and one .f32 in each case) for **index i=1**
**Crash:** when accessing index 2 with alignment 12

### Suboptimal Workaround

Setting `sz = Nitem * 4` (power of 2) instead of `sizeof(Float3)` prevents crashes but it gives 3 normal .f32 loads instead of one v2.f32 + one .f32

### Related

This works correctly for `Float4` (16 bytes, power of 2), which generates optimal `ld.global.v4.f32` instructions. The issue appears specific to non-power-of-2 struct sizes.


**Version info**

Details on Julia:

```
versioninfo()
Julia Version 1.11.7
Commit f2b3dbda30a (2025-09-08 12:10 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 22 × Intel(R) Core(TM) Ultra 7 165H
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, alderlake)
Threads: 22 default, 0 interactive, 11 GC (on 22 virtual cores)
Environment:
  JULIA_EDITOR = code
  JULIA_VSCODE_REPL = 1
  JULIA_NUM_THREADS = 22
```

Details on CUDA:

```
CUDA.versioninfo()
CUDA toolchain: 
- runtime 13.0, artifact installation
- driver 550.163.1 for 13.0
- compiler 13.0

CUDA libraries: 
- CUBLAS: 13.1.0
- CURAND: 10.4.0
- CUFFT: 12.0.0
- CUSOLVER: 12.0.4
- CUSPARSE: 12.6.3
- CUPTI: 2025.3.1 (API 13.0.1)
- NVML: 12.0.0+550.163.1

Julia packages: 
- CUDA: 5.9.2
- CUDA_Driver_jll: 13.0.2+0
- CUDA_Compiler_jll: 0.3.0+0
- CUDA_Runtime_jll: 0.19.2+0

Toolchain:
- Julia: 1.11.7
- LLVM: 16.0.6

1 device:
  0: NVIDIA RTX 1000 Ada Generation Laptop GPU (sm_89, 5.472 GiB / 5.997 GiB available)
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Alignment crash for unsafe loads/stores #2963

Minimal Reproducible Example

Expected Behavior

Actual Behavior

Suboptimal Workaround

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Alignment crash for unsafe loads/stores #2963

Description

Minimal Reproducible Example

Expected Behavior

Actual Behavior

Suboptimal Workaround

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions