Releases: JuliaGPU/CUDA.jl
Releases Β· JuliaGPU/CUDA.jl
v5.6.0
CUDA v5.6.0
CUDA.jl v5.6 is a relatively minor release, which the most important change being behind the scenes: GPUArrays.jl v11 has switched to KernelAbstractions.jl (#2524).
Features
- Update to CUDA 12.6.2 (#2512)
- CUSOLVER: support for
Xgeev!
(#2513),XsyevBatched
(#2577),gesv!
andgels!
(#2406) - CUBLAS: added multiplication of transpose / adjoint matrices by diagonal matrices (#2518, #2538)
- Improve handle cache performance in the presence of many short-lived tasks (#2583)
- CUFFT: Pre-allocate the buffer required for complex-to-real FFTs only once (#2578)
- Improved batched pointer conversion for very large batches (#2608)
Bug fixes
- Fix
findall
with an empty CuArray (#2554) - CUBLAS: Fix use of level 1 methods with strided arrays (#2528)
- CUSOLVER: Fix
Xgesvdr!
(#2556) - Preserve the array buffer type with more linear algebra operations (#2534)
Work around LinearAlgebra.jl breakage in Julia 1.11.2 concerning generic triangular(l/r)mul!
- (#2585) - Fix ambiguity of
LinearAlgebra.dot
(#2569) - Native RNG: Fixes when working with very large arrays (#2561)
- Avoid a deadlock due do union splitting in the
mapreduce
kernel (#2595) - Fix pinning of resized CPU memory by automatically re-pinning (#2599)
Merged pull requests:
- [CUSOLVER] Interface gesv! and gels! (#2406) (@amontoison)
- Update wrappers for CUDA v12.6.2 (#2512) (@amontoison)
- [CUSOLVER] Interface Xgeev! (#2513) (@amontoison)
- Added multiplication of transpose / adjoint matrices by diagonal matrices (#2518) (@amontoison)
- CompatHelper: bump compat for GPUCompiler to 1, (keep existing compat) (#2521) (@github-actions[bot])
- Adapt to GPUArrays.jl transition to KernelAbstractions.jl. (#2524) (@maleadt)
- Switch CI to 1.11. (#2525) (@maleadt)
- CUTENSOR: Reduce amount of broadcasts compiled during tests. (#2527) (@maleadt)
- CUBLAS: Don't use BLAS1 wrappers for strided arrays, only vectors. (#2528) (@maleadt)
- Clarify the synchronize(ctx)/device_synchronize() docstrings (#2532) (@JamesWrigley)
- Issue #2533: Preserving the buffer type in linear algebra (#2534) (@kmp5VT)
- Clarify description of how
LocalPreferences.toml
is generated in the docs (#2535) (@glwagner) - Adapt to JuliaGPU/GPUArrays.jl#567. (#2537) (@maleadt)
- Removed allocations for transpose/adjoint - diagonal multiplications (#2538) (@RedRussianBear)
- Consistent use of Nsight Compute (#2541) (@huiyuxie)
- Fix formatting in profiling docs page (#2543) (@efaulhaber)
- Fix typo in EnzymeCoreExt.jl (#2550) (@wsmoses)
- Enhance warning under a profiler (#2552) (@huiyuxie)
- Fix findall with an empty CuArray of Bool (#2554) (@amontoison)
- [CUSOLVER] Fix Xgesvdr! (#2556) (@amontoison)
- Test restore Enzyme.jl (#2557) (@wsmoses)
- Native RNG fixes for very large arrays (#2561) (@maleadt)
- [Enzyme] Mark launch_configuration as inactive (#2563) (@wsmoses)
- Update EnzymeCoreExt.jl (#2565) (@simenhu)
- Fix ambiguity of LinearAlgebra.dot (#2569) (@amontoison)
- [CUSOLVER]Β Add more tests for the dense SVD (#2574) (@amontoison)
- [CUSOLVER] Interface XsyevBatched (#2577) (@amontoison)
- [CUFFT] Preallocate a buffer for complex-to-real FFT (#2578) (@amontoison)
- Run the GC when failing to find a handle, but lots are active. (#2583) (@maleadt)
- Work around LinearAlgebra.jl breakage in 1.11.2. (#2585) (@maleadt)
- mapreduce: avoid deadlock by forcing the accumulator type. (#2596) (@maleadt)
- Switch to GitHub Actions-based benchmarks. (#2597) (@maleadt)
- Re-pin variable sized memory (#2599) (@jipolanco)
- Enzyme: add make_zero of cuarrays (#2600) (@wsmoses)
- Update cache.jl (#2604) (@jarbus)
- Enzyme: mark device_sync as non-differentiable [only downstream] (#2605) (@wsmoses)
- Move strided batch pointer conversion to GPU (#2608) (@THargreaves)
- Split linalg tests into multiple files (#2609) (@kshyatt)
Closed issues:
- Inference failure with sort(::CuMatrix) after loading MLDatasets (#2258)
- Kron Support for CuSparseMatrixCSC (#2370)
- Broadcasting a function returning an anonymous function with a constructor over CUDA arrays fails to compile, "not isbits" (#2514)
- CuArray view has different variable type outside x inside the cuda kernel (#2516)
- Can't build cuDNN on centos7.8 (#2517)
- Precompile errors (#2519)
- Precompile errors (#2520)
- Error returned from CUDA function in CUDA-aware MPI multi-GPU test (#2522)
- Broadcasting over random static array errors on Julia 1.11 (#2523)
gemm_strided_batched
only using strided CUDA kernel when first matrix is transposed (#2529)- CUDA runtime libraries are loaded from a system path due to LD_LIBRARY_PATH being set (#2530)
- [Bug]
UnifiedMemory
buffer changes during LinearAlgebra operations (#2533) - Improve system library warning when running under profiler (#2540)
- Local CUDA settings not propagated to Pkg.test (#2545)
- Out of Memory when working with Distributed for Small Matricies (#2548)
- findall is not working with an empty vector of bool (#2553)
- CUDA code does not return when running under VSC Debugging mode (#2558)
- dot is quite slow in multinest Arrays (#2559)
- UndefVarError:
backend
not defined inGPUArrays
(#2564) - view() returns CuArray instead of view for 1-D CuArrays (#2566)
- dot ambiguity (#2568)
- InvalidIRError thrown only if critical function is not previously compiled (#2573)
- circular dependency during precompilation (#2579)
- Sparse MatVec Is Nondeterministic? (#2582)
- CUDA triggers long Circular dependency list (#2586)
- Release v5.5.3 for GPUArray v11? (#2587)
- 'dot' gives different answers when viewing rather than slicing multidimensional arrays (#2589)
- Scalar indexing when performing
kron
on twoCuVector
s (#2591) - Faster strided-batched to batched wrapper (#2592)
- Error when copying data to pinned and resized CPU array (#2594)
- mapreducedim! size-dependent fail when narrowing float element types (#2595)
- Missing
Enzyme.make_zero
in Enzyme extension leads to incorrect behaviour (#2598) - 'ArgumentError: array must be non-empty' when attempting to pop idle handles from HandleCache (#2603)
- Do a release as current one doesn't support
GPUArrays
v11 (#2606)
v5.5.2
CUDA v5.5.2
Merged pull requests:
v5.5.1
What's Changed
- Update wrappers for CUDA v12.6.1 by @amontoison in #2499
- Enzyme: adapt to pending version breaking update by @wsmoses in #2490
Full Changelog: v5.5.0...v5.5.1
v5.5.0
CUDA v5.5.0
Merged pull requests:
- Add support for arbitrary group sizes in
gemm_grouped_batched!
(#2334) (@lpawela) - Add kernel compilation requirements to docs (#2416) (@termi-official)
- Enzyme: reverse mode kernels (#2422) (@wsmoses)
- CUFFT: Support Float16 (#2430) (@eschnett)
- Updated compute-sanitizer documentation (#2440) (@alexp616)
- Add troubleshooting section for NSight Compute (#2442) (@efaulhaber)
- Correct typo in documentation (#2445) (@eschnett)
- Bump minimal Julia requirement to v1.10. (#2447) (@maleadt)
- fix compute-sanitizer typo (#2448) (@alexp616)
- Address a corner case when establishing p2p access (#2457) (@findmyway)
- Implementation of spdiagm for CUSPARSE (#2458) (@walexaindre)
- Update to CUDA 12.6. (#2461) (@maleadt)
- CompatHelper: bump compat for GPUCompiler to 0.27, (keep existing compat) (#2462) (@github-actions[bot])
- Bump CUDA driver JLL. (#2463) (@maleadt)
- CUSOLVER (dense): cache workspace in fat handle (#2465) (@bjarthur)
- Revert "Run full GC when under very high memory pressure." (#2469) (@maleadt)
- Fix a method deprecation. (#2470) (@maleadt)
- Add Enzyme sum derivatives (#2471) (@wsmoses)
- Re-use pre-converted kernel arguments when launching kernels. (#2472) (@maleadt)
- Bump LLVM compat (#2473) (@maleadt)
- Bump subpackage compat. (#2475) (@maleadt)
- Enzyme: Reversemode cudaconvert (#2476) (@wsmoses)
- Ignore Enzyme.jl CI failures (#2479) (@maleadt)
- Re-enable enzyme testing (#2480) (@wsmoses)
- Add missing GC.@preserves. (#2487) (@maleadt)
- [CUSPARSE] Implement a sparse GEMV for CuSparseMatrixCSC * CuSparseVector (#2488) (@amontoison)
- [CUSPARSE] Add conversions between CuSparseVector and CuSparseMatrices (#2489) (@amontoison)
- Update to LLVM 9.1. (#2491) (@maleadt)
- Use at-consistent_overlay for 1.11 compatibility. (#2492) (@maleadt)
- Rework NNlib CI. (#2493) (@maleadt)
- CUSPARSE: Fix sparse constructor with duplicate elements. (#2495) (@maleadt)
Closed issues:
LinearAlgebra.norm(x)
falls back to generic implementation forx::Transpose
andx::Adjoint
(#1782)- dlclose'ing the compatibility driver can fail (#1848)
- Creating a sparse diagonal matrix of CuArray(u) (#1857)
- Support for Julia 1.11 (#2241)
- CUDA 12.4 Update 1: CUPTI does not trace kernels anymore (#2328)
- Adding CUDA to a PackageCompiler sysimage causes segfault (#2428)
- Error using CUDA on Julia 1.10:
Number of threads per block exceeds kernel limit
(#2438) - Error when I load my model (#2439)
- Driver JLL improvements (#2446)
- Deadlock when callling CUDA.jl in an adopted thread while blocking the main thread (#2449)
- CUDA.Mem.unregister fails with CUDA.jl 5.4 (not with 5.3) (#2452)
- Segmentation Fault on Loading CUDA (#2453)
Invalid instruction
error whenusing CUDA
(#2454)- Missing
adapt
for sparse andCUDABackend
(#2459) - CUDA precompile cannot find/load "cupti64_2024.2.1.dll" during precompilation (juliaup 1.10.4, Windows 11) (#2466)
- Request: Option to disable the "full GC when under very high memory pressure". (#2467)
- copyto! ambiguous (#2477)
- NeuralODE training failed on GPU with Enzyme (#2478)
- issue with atomic - when running standard test, @atomic modify expression missing field access (#2483)
- Support for creating a CuSparseMatrixCSC from a CuSparseVector (#2484)
- Issue with compiling CUDA and cuTENSOR using local libraries (#2486)
- Memory Access error in sparse array constructor (#2494)
- Forwards-compatible driver breaks CURAND (#2496)
- CUDA 12.6 Update 1 (#2497)
v5.4.3
CUDA v5.4.3
Merged pull requests:
- add cublasgetrsBatched (#2385) (@bjarthur)
- add two quirks for rationals (#2403) (@lanceXwq)
- Bump cuDNN (#2404) (@maleadt)
- Add convert method for ScaledPlan (#2409) (@david-macmahon)
- Conditionalize a quirk. (#2411) (@maleadt)
- Relax signature of generic matvecmul! (#2414) (@dkarrasch)
- Fix kron launch configuration. (#2418) (@maleadt)
- Run full GC when under very high memory pressure. (#2421) (@maleadt)
- Enzyme: Fix cuarray return type (#2425) (@wsmoses)
- CompatHelper: bump compat for LLVM to 8, (keep existing compat) (#2426) (@github-actions[bot])
- pre-allocated pivot and info buffers for getrf_batched (#2431) (@bjarthur)
- Profiler tweaks. (#2432) (@maleadt)
- Update the Julia wrappers for CUDA v12.5.1 (#2436) (@amontoison)
- Correct workspace handling (#2437) (@maleadt)
Closed issues:
- Legacy cuIpc* APIs incompatible with stream-ordered allocator (#1053)
- Broadcasted multiplication with a rational doesn't work (#1926)
- Incorrect grid size in
kron
(#2410) - GEMM of non-contiguous inputs should dispatch to fallback implementation (#2412)
- Failure of Eigenvalue Decomposition for Large Matrices. (#2413)
- CUDA_Driver_jll's lazy artifacts cause a precompilation-time warning (#2415)
- Recurrence of integer overflow bug (#1880) for a large matrix (#2427)
- CUDA kernel crash very occasionally when MPI.jl is just loaded. (#2429)
- CUDA_Runtime_Discovery Did not find cupti on Arm system with nvhpc (#2433)
- CUDA.jl won't install/run on Jetson Orin NX (#2435)
v5.4.2
CUDA v5.4.2
Merged pull requests:
v5.4.1
CUDA v5.4.1
Merged pull requests:
v5.4.0
CUDA v5.4.0
Merged pull requests:
- Support CUDA 12.5 (#2392) (@maleadt)
- Mark cuarray as noalias (#2395) (@wsmoses)
- Update Julia wrappers for CUDA v12.5 (#2396) (@amontoison)
- Enable correct pool access for cublasXt. (#2398) (@maleadt)
- More fine-grained CUPTI version checks. (#2399) (@maleadt)
Closed issues:
v5.3.5
CUDA v5.3.5
Merged pull requests:
- Avoid constructing
MulAddMul
s on Julia v1.12+ (#2277) (@dkarrasch) - CompatHelper: bump compat for LLVM to 7, (keep existing compat) (#2365) (@github-actions[bot])
- Enzyme: allocation functions (#2386) (@wsmoses)
- Tweaks to prevent context construction on some operations (#2387) (@maleadt)
- Fixes for Julia 1.12 / LLVM 17 (#2390) (@maleadt)
- CUBLAS: Make sure CUBLASLt wrappers use the correct library. (#2391) (@maleadt)
- Backport: Enzyme allocation fns (#2393) (@wsmoses)
Closed issues:
v5.3.4
CUDA v5.3.4
Merged pull requests:
- Add Enzyme Forward mode custom rule (#1869) (@wsmoses)
- Handle cache improvements (#2352) (@maleadt)
- Fix cuTensorNet compat (#2354) (@maleadt)
- Optimize array allocation. (#2355) (@maleadt)
- Change type restrictions in cuTENSOR operations (#2356) (@lkdvos)
- Bump julia-actions/setup-julia from 1 to 2 (#2357) (@dependabot[bot])
- Suggest use of 32 bit types over 64 instead of just Float32 over Float64 [skip ci] (#2358) (@Zentrik)
- Make generic_trimatmul more specific (#2359) (@tgymnich)
- Return the currect memory type when wrapping system memory. (#2363) (@maleadt)
- Mark cublas version/handle as non-differentiable (#2368) (@wsmoses)
- Enzyme: Forward mode sync (#2369) (@wsmoses)
- Enzyme: support fill (#2371) (@wsmoses)
- unsafe_wrap: unconditionally use the memory type provided by the user. (#2372) (@maleadt)
- Remove external_gvars. (#2373) (@maleadt)
- Tegra support with artifacts (#2374) (@maleadt)
- Backport Enzyme extension (#2375) (@wsmoses)
- Add note about --check-bounds=yes (#2378) (@Zinoex)
- Test Enzyme in a separate CI job. (#2379) (@maleadt)
- Fix tests for Tegra. (#2381) (@maleadt)
- Update Project.toml [remove EnzymeCore unconditional dep] (#2382) (@wsmoses)
Closed issues:
- Native Softmax (#175)
- CUSOLVER: support eigendecomposition (#173)
- backslash with gpu matrices crashes julia (#161)
- at-benchmark captures GPU arrays (#156)
- Support kernels returning Union{} (#62)
- mul! falls back to generic implementation (#148)
- \ on qr factorization objects gives a method error (#138)
- Compiler failure if dependent module only contains a
japi1
function (#49) - copy!(dst, src) and copyto!(dst, src) are significantly slower and allocate more memory than copyto!(dest, do, src, so[, N]) (#126)
- Calling Flux.gpu on a view dumps core (#125)
- Creating
CuArray{Tracker.TrackedReal{Float64},1}
a few times causes segfaults (#121) - Guard against exceeding maximum kernel parameter size (#32)
- Detect common API misuse in error handlers (#31)
rand
and friends default toFloat64
(#108)- \ does not work for least squares (#104)
- ERROR_ILLEGAL_ADDRESS when broadcasting modular arithmetic (#94)
- CuIterator assumes batches to consist of multiple arrays (#86)
- Algebra with UniformScaling Uses Generic Fallback Scalar Indexing (#85)
- Document (un)supported language features for kernel programming (#13)
- Missing dispatch for indexing of reshaped arrays (#556)
- Track array ownership to avoid illegal memory accesses (#763)
- NVPTX i128 support broken on LLVM 11 / Julia 1.6 (#793)
- Support for
sm_80
cp.async
: asynchronous on-device copies (#850) - Profiling Julia with Nsight Systems on Windows results in blank window (#862)
- sort! and partialsort! are considerably slower than CPU versions (#937)
- mul! does not dispatch on Adjoint (#1363)
- Cross-device copy of wrapped arrays fails (#1377)
- Memory allocation becomes very slow when reserved bytes is large (#1540)
- Cannot reclaim GPU Memory; CUDA.reclaim() (#1562)
- Add eigen for general purpose computation of eigenvectors/eigenvalues (#1572)
- device_reset! does not seem to work anymore (#1579)
- device-side rand() are not random between successive kernel launches (#1633)
- Add EnzymeRules support for CUDA.jl (for forward mode here) (#1811)
cusparseSetStream_v2
not defined (#1820)- Feature request: Integrating the latest CUDA library "cuLitho" into CUDA.jl (#1821)
- KernelAbstractions.jl-related issues (#1838)
- lock failing in multithreaded plan_fft() (#1921)
- CUSolver finalizer tries to take ReentrantLock (#1923)
- Testsuite could be more careful about parallel testing (#2192)
- Opportunistic GC collection (#2303)
- Unable to use local CUDA runtime toolkit (#2367)
- Enzyme prevents testing on 1.11 (#2376)