Releases · JuliaGPU/CUDA.jl

08 Jan 10:31

github-actions

v5.6.0

fc952a3

v5.6.0 Latest

Latest

CUDA v5.6.0

Diff since v5.5.2

CUDA.jl v5.6 is a relatively minor release, which the most important change being behind the scenes: GPUArrays.jl v11 has switched to KernelAbstractions.jl (#2524).

Features

Update to CUDA 12.6.2 (#2512)
CUSOLVER: support for Xgeev! (#2513), XsyevBatched (#2577), gesv! and gels! (#2406)
CUBLAS: added multiplication of transpose / adjoint matrices by diagonal matrices (#2518, #2538)
Improve handle cache performance in the presence of many short-lived tasks (#2583)
CUFFT: Pre-allocate the buffer required for complex-to-real FFTs only once (#2578)
Improved batched pointer conversion for very large batches (#2608)

Bug fixes

Fix findall with an empty CuArray (#2554)
CUBLAS: Fix use of level 1 methods with strided arrays (#2528)
CUSOLVER: Fix Xgesvdr! (#2556)
Preserve the array buffer type with more linear algebra operations (#2534)
Work around LinearAlgebra.jl breakage in Julia 1.11.2 concerning generic triangular (l/r)mul! - (#2585)
Fix ambiguity of LinearAlgebra.dot (#2569)
Native RNG: Fixes when working with very large arrays (#2561)
Avoid a deadlock due do union splitting in the mapreduce kernel (#2595)
Fix pinning of resized CPU memory by automatically re-pinning (#2599)

Merged pull requests:

[CUSOLVER] Interface gesv! and gels! (#2406) (@amontoison)
Update wrappers for CUDA v12.6.2 (#2512) (@amontoison)
[CUSOLVER] Interface Xgeev! (#2513) (@amontoison)
Added multiplication of transpose / adjoint matrices by diagonal matrices (#2518) (@amontoison)
CompatHelper: bump compat for GPUCompiler to 1, (keep existing compat) (#2521) (@github-actions[bot])
Adapt to GPUArrays.jl transition to KernelAbstractions.jl. (#2524) (@maleadt)
Switch CI to 1.11. (#2525) (@maleadt)
CUTENSOR: Reduce amount of broadcasts compiled during tests. (#2527) (@maleadt)
CUBLAS: Don't use BLAS1 wrappers for strided arrays, only vectors. (#2528) (@maleadt)
Clarify the synchronize(ctx)/device_synchronize() docstrings (#2532) (@JamesWrigley)
Issue #2533: Preserving the buffer type in linear algebra (#2534) (@kmp5VT)
Clarify description of how LocalPreferences.toml is generated in the docs (#2535) (@glwagner)
Adapt to JuliaGPU/GPUArrays.jl#567. (#2537) (@maleadt)
Removed allocations for transpose/adjoint - diagonal multiplications (#2538) (@RedRussianBear)
Consistent use of Nsight Compute (#2541) (@huiyuxie)
Fix formatting in profiling docs page (#2543) (@efaulhaber)
Fix typo in EnzymeCoreExt.jl (#2550) (@wsmoses)
Enhance warning under a profiler (#2552) (@huiyuxie)
Fix findall with an empty CuArray of Bool (#2554) (@amontoison)
[CUSOLVER] Fix Xgesvdr! (#2556) (@amontoison)
Test restore Enzyme.jl (#2557) (@wsmoses)
Native RNG fixes for very large arrays (#2561) (@maleadt)
[Enzyme] Mark launch_configuration as inactive (#2563) (@wsmoses)
Update EnzymeCoreExt.jl (#2565) (@simenhu)
Fix ambiguity of LinearAlgebra.dot (#2569) (@amontoison)
[CUSOLVER] Add more tests for the dense SVD (#2574) (@amontoison)
[CUSOLVER] Interface XsyevBatched (#2577) (@amontoison)
[CUFFT] Preallocate a buffer for complex-to-real FFT (#2578) (@amontoison)
Run the GC when failing to find a handle, but lots are active. (#2583) (@maleadt)
Work around LinearAlgebra.jl breakage in 1.11.2. (#2585) (@maleadt)
mapreduce: avoid deadlock by forcing the accumulator type. (#2596) (@maleadt)
Switch to GitHub Actions-based benchmarks. (#2597) (@maleadt)
Re-pin variable sized memory (#2599) (@jipolanco)
Enzyme: add make_zero of cuarrays (#2600) (@wsmoses)
Update cache.jl (#2604) (@jarbus)
Enzyme: mark device_sync as non-differentiable [only downstream] (#2605) (@wsmoses)
Move strided batch pointer conversion to GPU (#2608) (@THargreaves)
Split linalg tests into multiple files (#2609) (@kshyatt)

Closed issues:

Inference failure with sort(::CuMatrix) after loading MLDatasets (#2258)
Kron Support for CuSparseMatrixCSC (#2370)
Broadcasting a function returning an anonymous function with a constructor over CUDA arrays fails to compile, "not isbits" (#2514)
CuArray view has different variable type outside x inside the cuda kernel (#2516)
Can't build cuDNN on centos7.8 (#2517)
Precompile errors (#2519)
Precompile errors (#2520)
Error returned from CUDA function in CUDA-aware MPI multi-GPU test (#2522)
Broadcasting over random static array errors on Julia 1.11 (#2523)
gemm_strided_batched only using strided CUDA kernel when first matrix is transposed (#2529)
CUDA runtime libraries are loaded from a system path due to LD_LIBRARY_PATH being set (#2530)
[Bug] UnifiedMemory buffer changes during LinearAlgebra operations (#2533)
Improve system library warning when running under profiler (#2540)
Local CUDA settings not propagated to Pkg.test (#2545)
Out of Memory when working with Distributed for Small Matricies (#2548)
findall is not working with an empty vector of bool (#2553)
CUDA code does not return when running under VSC Debugging mode (#2558)
dot is quite slow in multinest Arrays (#2559)
UndefVarError: backend not defined in GPUArrays (#2564)
view() returns CuArray instead of view for 1-D CuArrays (#2566)
dot ambiguity (#2568)
InvalidIRError thrown only if critical function is not previously compiled (#2573)
circular dependency during precompilation (#2579)
Sparse MatVec Is Nondeterministic? (#2582)
CUDA triggers long Circular dependency list (#2586)
Release v5.5.3 for GPUArray v11? (#2587)
'dot' gives different answers when viewing rather than slicing multidimensional arrays (#2589)
Scalar indexing when performing kron on two CuVectors (#2591)
Faster strided-batched to batched wrapper (#2592)
Error when copying data to pinned and resized CPU array (#2594)
mapreducedim! size-dependent fail when narrowing float element types (#2595)
Missing Enzyme.make_zero in Enzyme extension leads to incorrect behaviour (#2598)
'ArgumentError: array must be non-empty' when attempting to pop idle handles from HandleCache (#2603)
Do a release as current one doesn't support GPUArrays v11 (#2606)

Contributors

maleadt, kshyatt, and 12 other contributors

Assets 2

26 Sep 05:51

github-actions

v5.5.2

a1db081

v5.5.2

CUDA v5.5.2

Diff since v5.5.1

Merged pull requests:

Fix type of AbstractFFTs.Plan for real-complex FFTs (#2504) (@jipolanco)
Profiler: Demangle kernel names. (#2505) (@maleadt)
Bump CUDNN. (#2507) (@maleadt)
Restore Enzyme checks (#2508) (@wsmoses)

Contributors

maleadt, wsmoses, and jipolanco

Assets 2

23 Sep 10:24

maleadt

v5.5.1

3b05baf

v5.5.1

What's Changed

Update wrappers for CUDA v12.6.1 by @amontoison in #2499
Enzyme: adapt to pending version breaking update by @wsmoses in #2490

Full Changelog: v5.5.0...v5.5.1

Contributors

wsmoses and amontoison

Assets 2

18 Sep 14:28

github-actions

v5.5.0

1fe8838

v5.5.0

CUDA v5.5.0

Blog post

Diff since v5.4.3

Merged pull requests:

Add support for arbitrary group sizes in gemm_grouped_batched! (#2334) (@lpawela)
Add kernel compilation requirements to docs (#2416) (@termi-official)
Enzyme: reverse mode kernels (#2422) (@wsmoses)
CUFFT: Support Float16 (#2430) (@eschnett)
Updated compute-sanitizer documentation (#2440) (@alexp616)
Add troubleshooting section for NSight Compute (#2442) (@efaulhaber)
Correct typo in documentation (#2445) (@eschnett)
Bump minimal Julia requirement to v1.10. (#2447) (@maleadt)
fix compute-sanitizer typo (#2448) (@alexp616)
Address a corner case when establishing p2p access (#2457) (@findmyway)
Implementation of spdiagm for CUSPARSE (#2458) (@walexaindre)
Update to CUDA 12.6. (#2461) (@maleadt)
CompatHelper: bump compat for GPUCompiler to 0.27, (keep existing compat) (#2462) (@github-actions[bot])
Bump CUDA driver JLL. (#2463) (@maleadt)
CUSOLVER (dense): cache workspace in fat handle (#2465) (@bjarthur)
Revert "Run full GC when under very high memory pressure." (#2469) (@maleadt)
Fix a method deprecation. (#2470) (@maleadt)
Add Enzyme sum derivatives (#2471) (@wsmoses)
Re-use pre-converted kernel arguments when launching kernels. (#2472) (@maleadt)
Bump LLVM compat (#2473) (@maleadt)
Bump subpackage compat. (#2475) (@maleadt)
Enzyme: Reversemode cudaconvert (#2476) (@wsmoses)
Ignore Enzyme.jl CI failures (#2479) (@maleadt)
Re-enable enzyme testing (#2480) (@wsmoses)
Add missing GC.@preserves. (#2487) (@maleadt)
[CUSPARSE] Implement a sparse GEMV for CuSparseMatrixCSC * CuSparseVector (#2488) (@amontoison)
[CUSPARSE] Add conversions between CuSparseVector and CuSparseMatrices (#2489) (@amontoison)
Update to LLVM 9.1. (#2491) (@maleadt)
Use at-consistent_overlay for 1.11 compatibility. (#2492) (@maleadt)
Rework NNlib CI. (#2493) (@maleadt)
CUSPARSE: Fix sparse constructor with duplicate elements. (#2495) (@maleadt)

Closed issues:

LinearAlgebra.norm(x) falls back to generic implementation for x::Transpose and x::Adjoint (#1782)
dlclose'ing the compatibility driver can fail (#1848)
Creating a sparse diagonal matrix of CuArray(u) (#1857)
Support for Julia 1.11 (#2241)
CUDA 12.4 Update 1: CUPTI does not trace kernels anymore (#2328)
Adding CUDA to a PackageCompiler sysimage causes segfault (#2428)
Error using CUDA on Julia 1.10: Number of threads per block exceeds kernel limit (#2438)
Error when I load my model (#2439)
Driver JLL improvements (#2446)
Deadlock when callling CUDA.jl in an adopted thread while blocking the main thread (#2449)
CUDA.Mem.unregister fails with CUDA.jl 5.4 (not with 5.3) (#2452)
Segmentation Fault on Loading CUDA (#2453)
Invalid instruction error when using CUDA (#2454)
Missing adapt for sparse and CUDABackend (#2459)
CUDA precompile cannot find/load "cupti64_2024.2.1.dll" during precompilation (juliaup 1.10.4, Windows 11) (#2466)
Request: Option to disable the "full GC when under very high memory pressure". (#2467)
copyto! ambiguous (#2477)
NeuralODE training failed on GPU with Enzyme (#2478)
issue with atomic - when running standard test, @atomic modify expression missing field access (#2483)
Support for creating a CuSparseMatrixCSC from a CuSparseVector (#2484)
Issue with compiling CUDA and cuTENSOR using local libraries (#2486)
Memory Access error in sparse array constructor (#2494)
Forwards-compatible driver breaks CURAND (#2496)
CUDA 12.6 Update 1 (#2497)

Contributors

eschnett, maleadt, and 11 other contributors

Assets 2

09 Jul 08:09

github-actions

v5.4.3

71311af

v5.4.3

CUDA v5.4.3

Diff since v5.4.2

Merged pull requests:

add cublasgetrsBatched (#2385) (@bjarthur)
add two quirks for rationals (#2403) (@lanceXwq)
Bump cuDNN (#2404) (@maleadt)
Add convert method for ScaledPlan (#2409) (@david-macmahon)
Conditionalize a quirk. (#2411) (@maleadt)
Relax signature of generic matvecmul! (#2414) (@dkarrasch)
Fix kron launch configuration. (#2418) (@maleadt)
Run full GC when under very high memory pressure. (#2421) (@maleadt)
Enzyme: Fix cuarray return type (#2425) (@wsmoses)
CompatHelper: bump compat for LLVM to 8, (keep existing compat) (#2426) (@github-actions[bot])
pre-allocated pivot and info buffers for getrf_batched (#2431) (@bjarthur)
Profiler tweaks. (#2432) (@maleadt)
Update the Julia wrappers for CUDA v12.5.1 (#2436) (@amontoison)
Correct workspace handling (#2437) (@maleadt)

Closed issues:

Legacy cuIpc* APIs incompatible with stream-ordered allocator (#1053)
Broadcasted multiplication with a rational doesn't work (#1926)
Incorrect grid size in kron (#2410)
GEMM of non-contiguous inputs should dispatch to fallback implementation (#2412)
Failure of Eigenvalue Decomposition for Large Matrices. (#2413)
CUDA_Driver_jll's lazy artifacts cause a precompilation-time warning (#2415)
Recurrence of integer overflow bug (#1880) for a large matrix (#2427)
CUDA kernel crash very occasionally when MPI.jl is just loaded. (#2429)
CUDA_Runtime_Discovery Did not find cupti on Arm system with nvhpc (#2433)
CUDA.jl won't install/run on Jetson Orin NX (#2435)

Contributors

maleadt, david-macmahon, and 5 other contributors

Assets 2

29 May 07:35

github-actions

v5.4.2

7e6a57a

v5.4.2

CUDA v5.4.2

Diff since v5.4.1

Merged pull requests:

Fix and test the legacy memory pool. (#2402) (@maleadt)

Contributors

maleadt

Assets 2

28 May 18:53

github-actions

v5.4.1

5bbd9a7

v5.4.1

CUDA v5.4.1

Diff since v5.4.0

Merged pull requests:

Fixup Enzyme: Mark CuArray as noalias (#2401) (@wsmoses)

Contributors

wsmoses

Assets 2

28 May 06:45

github-actions

v5.4.0

f2062a5

v5.4.0

CUDA v5.4.0

Blog post

Diff since v5.3.5

Merged pull requests:

Support CUDA 12.5 (#2392) (@maleadt)
Mark cuarray as noalias (#2395) (@wsmoses)
Update Julia wrappers for CUDA v12.5 (#2396) (@amontoison)
Enable correct pool access for cublasXt. (#2398) (@maleadt)
More fine-grained CUPTI version checks. (#2399) (@maleadt)

Closed issues:

CUTENSOR breaks after device_reset! (#2319)
cuBLASXt's xt_gemm! incompatible with stream-ordered allocated memory (#2320)
Add helper function to recompile CUDA stack (#2364)

Contributors

maleadt, wsmoses, and amontoison

Assets 2

24 May 13:29

github-actions

v5.3.5

7232f85

v5.3.5

CUDA v5.3.5

Diff since v5.3.4

Merged pull requests:

Avoid constructing MulAddMuls on Julia v1.12+ (#2277) (@dkarrasch)
CompatHelper: bump compat for LLVM to 7, (keep existing compat) (#2365) (@github-actions[bot])
Enzyme: allocation functions (#2386) (@wsmoses)
Tweaks to prevent context construction on some operations (#2387) (@maleadt)
Fixes for Julia 1.12 / LLVM 17 (#2390) (@maleadt)
CUBLAS: Make sure CUBLASLt wrappers use the correct library. (#2391) (@maleadt)
Backport: Enzyme allocation fns (#2393) (@wsmoses)

Closed issues:

Indexing a view uses scalar indexing (#1472)
EnzymeCore is an unconditional dependency. (#2380)
cuBLASLt wrappers ccall into cuBLAS (#2388)
generic_trimatmul! error (#2389)

Contributors

maleadt, wsmoses, and dkarrasch

Assets 2

15 May 19:28

github-actions

v5.3.4

c373258

v5.3.4

CUDA v5.3.4

Diff since v5.3.3

Merged pull requests:

Add Enzyme Forward mode custom rule (#1869) (@wsmoses)
Handle cache improvements (#2352) (@maleadt)
Fix cuTensorNet compat (#2354) (@maleadt)
Optimize array allocation. (#2355) (@maleadt)
Change type restrictions in cuTENSOR operations (#2356) (@lkdvos)
Bump julia-actions/setup-julia from 1 to 2 (#2357) (@dependabot[bot])
Suggest use of 32 bit types over 64 instead of just Float32 over Float64 [skip ci] (#2358) (@Zentrik)
Make generic_trimatmul more specific (#2359) (@tgymnich)
Return the currect memory type when wrapping system memory. (#2363) (@maleadt)
Mark cublas version/handle as non-differentiable (#2368) (@wsmoses)
Enzyme: Forward mode sync (#2369) (@wsmoses)
Enzyme: support fill (#2371) (@wsmoses)
unsafe_wrap: unconditionally use the memory type provided by the user. (#2372) (@maleadt)
Remove external_gvars. (#2373) (@maleadt)
Tegra support with artifacts (#2374) (@maleadt)
Backport Enzyme extension (#2375) (@wsmoses)
Add note about --check-bounds=yes (#2378) (@Zinoex)
Test Enzyme in a separate CI job. (#2379) (@maleadt)
Fix tests for Tegra. (#2381) (@maleadt)
Update Project.toml [remove EnzymeCore unconditional dep] (#2382) (@wsmoses)

Closed issues:

Native Softmax (#175)
CUSOLVER: support eigendecomposition (#173)
backslash with gpu matrices crashes julia (#161)
at-benchmark captures GPU arrays (#156)
Support kernels returning Union{} (#62)
mul! falls back to generic implementation (#148)
\ on qr factorization objects gives a method error (#138)
Compiler failure if dependent module only contains a japi1 function (#49)
copy!(dst, src) and copyto!(dst, src) are significantly slower and allocate more memory than copyto!(dest, do, src, so[, N]) (#126)
Calling Flux.gpu on a view dumps core (#125)
Creating CuArray{Tracker.TrackedReal{Float64},1} a few times causes segfaults (#121)
Guard against exceeding maximum kernel parameter size (#32)
Detect common API misuse in error handlers (#31)
rand and friends default to Float64 (#108)
\ does not work for least squares (#104)
ERROR_ILLEGAL_ADDRESS when broadcasting modular arithmetic (#94)
CuIterator assumes batches to consist of multiple arrays (#86)
Algebra with UniformScaling Uses Generic Fallback Scalar Indexing (#85)
Document (un)supported language features for kernel programming (#13)
Missing dispatch for indexing of reshaped arrays (#556)
Track array ownership to avoid illegal memory accesses (#763)
NVPTX i128 support broken on LLVM 11 / Julia 1.6 (#793)
Support for sm_80 cp.async: asynchronous on-device copies (#850)
Profiling Julia with Nsight Systems on Windows results in blank window (#862)
sort! and partialsort! are considerably slower than CPU versions (#937)
mul! does not dispatch on Adjoint (#1363)
Cross-device copy of wrapped arrays fails (#1377)
Memory allocation becomes very slow when reserved bytes is large (#1540)
Cannot reclaim GPU Memory; CUDA.reclaim() (#1562)
Add eigen for general purpose computation of eigenvectors/eigenvalues (#1572)
device_reset! does not seem to work anymore (#1579)
device-side rand() are not random between successive kernel launches (#1633)
Add EnzymeRules support for CUDA.jl (for forward mode here) (#1811)
cusparseSetStream_v2 not defined (#1820)
Feature request: Integrating the latest CUDA library "cuLitho" into CUDA.jl (#1821)
KernelAbstractions.jl-related issues (#1838)
lock failing in multithreaded plan_fft() (#1921)
CUSolver finalizer tries to take ReentrantLock (#1923)
Testsuite could be more careful about parallel testing (#2192)
Opportunistic GC collection (#2303)
Unable to use local CUDA runtime toolkit (#2367)
Enzyme prevents testing on 1.11 (#2376)

Contributors

maleadt, wsmoses, and 5 other contributors

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA v5.6.0

Features

Bug fixes

Contributors

CUDA v5.5.2

Contributors

What's Changed

Contributors

CUDA v5.5.0

Contributors

CUDA v5.4.3

Contributors

CUDA v5.4.2

Contributors

CUDA v5.4.1

Contributors

CUDA v5.4.0

Contributors

CUDA v5.3.5

Contributors

CUDA v5.3.4

Contributors

Releases: JuliaGPU/CUDA.jl

v5.6.0

CUDA v5.6.0

Features

Bug fixes

Contributors

v5.5.2

CUDA v5.5.2

Contributors

v5.5.1

What's Changed

Contributors

v5.5.0

CUDA v5.5.0

Contributors

v5.4.3

CUDA v5.4.3

Contributors

v5.4.2

CUDA v5.4.2

Contributors

v5.4.1

CUDA v5.4.1

Contributors

v5.4.0

CUDA v5.4.0

Contributors

v5.3.5

CUDA v5.3.5

Contributors

v5.3.4

CUDA v5.3.4

Contributors