Skip to content

Releases: JuliaGPU/CUDA.jl

v5.6.0

08 Jan 10:31
fc952a3
Compare
Choose a tag to compare

CUDA v5.6.0

Diff since v5.5.2

CUDA.jl v5.6 is a relatively minor release, which the most important change being behind the scenes: GPUArrays.jl v11 has switched to KernelAbstractions.jl (#2524).

Features

  • Update to CUDA 12.6.2 (#2512)
  • CUSOLVER: support for Xgeev! (#2513), XsyevBatched (#2577), gesv! and gels! (#2406)
  • CUBLAS: added multiplication of transpose / adjoint matrices by diagonal matrices (#2518, #2538)
  • Improve handle cache performance in the presence of many short-lived tasks (#2583)
  • CUFFT: Pre-allocate the buffer required for complex-to-real FFTs only once (#2578)
  • Improved batched pointer conversion for very large batches (#2608)

Bug fixes

  • Fix findall with an empty CuArray (#2554)
  • CUBLAS: Fix use of level 1 methods with strided arrays (#2528)
  • CUSOLVER: Fix Xgesvdr! (#2556)
  • Preserve the array buffer type with more linear algebra operations (#2534)
    Work around LinearAlgebra.jl breakage in Julia 1.11.2 concerning generic triangular (l/r)mul! - (#2585)
  • Fix ambiguity of LinearAlgebra.dot (#2569)
  • Native RNG: Fixes when working with very large arrays (#2561)
  • Avoid a deadlock due do union splitting in the mapreduce kernel (#2595)
  • Fix pinning of resized CPU memory by automatically re-pinning (#2599)

Merged pull requests:

Closed issues:

  • Inference failure with sort(::CuMatrix) after loading MLDatasets (#2258)
  • Kron Support for CuSparseMatrixCSC (#2370)
  • Broadcasting a function returning an anonymous function with a constructor over CUDA arrays fails to compile, "not isbits" (#2514)
  • CuArray view has different variable type outside x inside the cuda kernel (#2516)
  • Can't build cuDNN on centos7.8 (#2517)
  • Precompile errors (#2519)
  • Precompile errors (#2520)
  • Error returned from CUDA function in CUDA-aware MPI multi-GPU test (#2522)
  • Broadcasting over random static array errors on Julia 1.11 (#2523)
  • gemm_strided_batched only using strided CUDA kernel when first matrix is transposed (#2529)
  • CUDA runtime libraries are loaded from a system path due to LD_LIBRARY_PATH being set (#2530)
  • [Bug] UnifiedMemory buffer changes during LinearAlgebra operations (#2533)
  • Improve system library warning when running under profiler (#2540)
  • Local CUDA settings not propagated to Pkg.test (#2545)
  • Out of Memory when working with Distributed for Small Matricies (#2548)
  • findall is not working with an empty vector of bool (#2553)
  • CUDA code does not return when running under VSC Debugging mode (#2558)
  • dot is quite slow in multinest Arrays (#2559)
  • UndefVarError: backend not defined in GPUArrays (#2564)
  • view() returns CuArray instead of view for 1-D CuArrays (#2566)
  • dot ambiguity (#2568)
  • InvalidIRError thrown only if critical function is not previously compiled (#2573)
  • circular dependency during precompilation (#2579)
  • Sparse MatVec Is Nondeterministic? (#2582)
  • CUDA triggers long Circular dependency list (#2586)
  • Release v5.5.3 for GPUArray v11? (#2587)
  • 'dot' gives different answers when viewing rather than slicing multidimensional arrays (#2589)
  • Scalar indexing when performing kron on two CuVectors (#2591)
  • Faster strided-batched to batched wrapper (#2592)
  • Error when copying data to pinned and resized CPU array (#2594)
  • mapreducedim! size-dependent fail when narrowing float element types (#2595)
  • Missing Enzyme.make_zero in Enzyme extension leads to incorrect behaviour (#2598)
  • 'ArgumentError: array must be non-empty' when attempting to pop idle handles from HandleCache (#2603)
  • Do a release as current one doesn't support GPUArrays v11 (#2606)

v5.5.2

26 Sep 05:51
a1db081
Compare
Choose a tag to compare

CUDA v5.5.2

Diff since v5.5.1

Merged pull requests:

v5.5.1

23 Sep 10:24
3b05baf
Compare
Choose a tag to compare

What's Changed

Full Changelog: v5.5.0...v5.5.1

v5.5.0

18 Sep 14:28
1fe8838
Compare
Choose a tag to compare

CUDA v5.5.0

Blog post

Diff since v5.4.3

Merged pull requests:

Closed issues:

  • LinearAlgebra.norm(x) falls back to generic implementation for x::Transpose and x::Adjoint (#1782)
  • dlclose'ing the compatibility driver can fail (#1848)
  • Creating a sparse diagonal matrix of CuArray(u) (#1857)
  • Support for Julia 1.11 (#2241)
  • CUDA 12.4 Update 1: CUPTI does not trace kernels anymore (#2328)
  • Adding CUDA to a PackageCompiler sysimage causes segfault (#2428)
  • Error using CUDA on Julia 1.10: Number of threads per block exceeds kernel limit (#2438)
  • Error when I load my model (#2439)
  • Driver JLL improvements (#2446)
  • Deadlock when callling CUDA.jl in an adopted thread while blocking the main thread (#2449)
  • CUDA.Mem.unregister fails with CUDA.jl 5.4 (not with 5.3) (#2452)
  • Segmentation Fault on Loading CUDA (#2453)
  • Invalid instruction error when using CUDA (#2454)
  • Missing adapt for sparse and CUDABackend (#2459)
  • CUDA precompile cannot find/load "cupti64_2024.2.1.dll" during precompilation (juliaup 1.10.4, Windows 11) (#2466)
  • Request: Option to disable the "full GC when under very high memory pressure". (#2467)
  • copyto! ambiguous (#2477)
  • NeuralODE training failed on GPU with Enzyme (#2478)
  • issue with atomic - when running standard test, @atomic modify expression missing field access (#2483)
  • Support for creating a CuSparseMatrixCSC from a CuSparseVector (#2484)
  • Issue with compiling CUDA and cuTENSOR using local libraries (#2486)
  • Memory Access error in sparse array constructor (#2494)
  • Forwards-compatible driver breaks CURAND (#2496)
  • CUDA 12.6 Update 1 (#2497)

v5.4.3

09 Jul 08:09
71311af
Compare
Choose a tag to compare

CUDA v5.4.3

Diff since v5.4.2

Merged pull requests:

Closed issues:

  • Legacy cuIpc* APIs incompatible with stream-ordered allocator (#1053)
  • Broadcasted multiplication with a rational doesn't work (#1926)
  • Incorrect grid size in kron (#2410)
  • GEMM of non-contiguous inputs should dispatch to fallback implementation (#2412)
  • Failure of Eigenvalue Decomposition for Large Matrices. (#2413)
  • CUDA_Driver_jll's lazy artifacts cause a precompilation-time warning (#2415)
  • Recurrence of integer overflow bug (#1880) for a large matrix (#2427)
  • CUDA kernel crash very occasionally when MPI.jl is just loaded. (#2429)
  • CUDA_Runtime_Discovery Did not find cupti on Arm system with nvhpc (#2433)
  • CUDA.jl won't install/run on Jetson Orin NX (#2435)

v5.4.2

29 May 07:35
7e6a57a
Compare
Choose a tag to compare

CUDA v5.4.2

Diff since v5.4.1

Merged pull requests:

v5.4.1

28 May 18:53
5bbd9a7
Compare
Choose a tag to compare

CUDA v5.4.1

Diff since v5.4.0

Merged pull requests:

v5.4.0

28 May 06:45
Compare
Choose a tag to compare

CUDA v5.4.0

Blog post

Diff since v5.3.5

Merged pull requests:

Closed issues:

  • CUTENSOR breaks after device_reset! (#2319)
  • cuBLASXt's xt_gemm! incompatible with stream-ordered allocated memory (#2320)
  • Add helper function to recompile CUDA stack (#2364)

v5.3.5

24 May 13:29
7232f85
Compare
Choose a tag to compare

CUDA v5.3.5

Diff since v5.3.4

Merged pull requests:

  • Avoid constructing MulAddMuls on Julia v1.12+ (#2277) (@dkarrasch)
  • CompatHelper: bump compat for LLVM to 7, (keep existing compat) (#2365) (@github-actions[bot])
  • Enzyme: allocation functions (#2386) (@wsmoses)
  • Tweaks to prevent context construction on some operations (#2387) (@maleadt)
  • Fixes for Julia 1.12 / LLVM 17 (#2390) (@maleadt)
  • CUBLAS: Make sure CUBLASLt wrappers use the correct library. (#2391) (@maleadt)
  • Backport: Enzyme allocation fns (#2393) (@wsmoses)

Closed issues:

  • Indexing a view uses scalar indexing (#1472)
  • EnzymeCore is an unconditional dependency. (#2380)
  • cuBLASLt wrappers ccall into cuBLAS (#2388)
  • generic_trimatmul! error (#2389)

v5.3.4

15 May 19:28
c373258
Compare
Choose a tag to compare

CUDA v5.3.4

Diff since v5.3.3

Merged pull requests:

Closed issues:

  • Native Softmax (#175)
  • CUSOLVER: support eigendecomposition (#173)
  • backslash with gpu matrices crashes julia (#161)
  • at-benchmark captures GPU arrays (#156)
  • Support kernels returning Union{} (#62)
  • mul! falls back to generic implementation (#148)
  • \ on qr factorization objects gives a method error (#138)
  • Compiler failure if dependent module only contains a japi1 function (#49)
  • copy!(dst, src) and copyto!(dst, src) are significantly slower and allocate more memory than copyto!(dest, do, src, so[, N]) (#126)
  • Calling Flux.gpu on a view dumps core (#125)
  • Creating CuArray{Tracker.TrackedReal{Float64},1} a few times causes segfaults (#121)
  • Guard against exceeding maximum kernel parameter size (#32)
  • Detect common API misuse in error handlers (#31)
  • rand and friends default to Float64 (#108)
  • \ does not work for least squares (#104)
  • ERROR_ILLEGAL_ADDRESS when broadcasting modular arithmetic (#94)
  • CuIterator assumes batches to consist of multiple arrays (#86)
  • Algebra with UniformScaling Uses Generic Fallback Scalar Indexing (#85)
  • Document (un)supported language features for kernel programming (#13)
  • Missing dispatch for indexing of reshaped arrays (#556)
  • Track array ownership to avoid illegal memory accesses (#763)
  • NVPTX i128 support broken on LLVM 11 / Julia 1.6 (#793)
  • Support for sm_80 cp.async: asynchronous on-device copies (#850)
  • Profiling Julia with Nsight Systems on Windows results in blank window (#862)
  • sort! and partialsort! are considerably slower than CPU versions (#937)
  • mul! does not dispatch on Adjoint (#1363)
  • Cross-device copy of wrapped arrays fails (#1377)
  • Memory allocation becomes very slow when reserved bytes is large (#1540)
  • Cannot reclaim GPU Memory; CUDA.reclaim() (#1562)
  • Add eigen for general purpose computation of eigenvectors/eigenvalues (#1572)
  • device_reset! does not seem to work anymore (#1579)
  • device-side rand() are not random between successive kernel launches (#1633)
  • Add EnzymeRules support for CUDA.jl (for forward mode here) (#1811)
  • cusparseSetStream_v2 not defined (#1820)
  • Feature request: Integrating the latest CUDA library "cuLitho" into CUDA.jl (#1821)
  • KernelAbstractions.jl-related issues (#1838)
  • lock failing in multithreaded plan_fft() (#1921)
  • CUSolver finalizer tries to take ReentrantLock (#1923)
  • Testsuite could be more careful about parallel testing (#2192)
  • Opportunistic GC collection (#2303)
  • Unable to use local CUDA runtime toolkit (#2367)
  • Enzyme prevents testing on 1.11 (#2376)