Add cuDLA bindings by nikshayshrivastava · Pull Request #2034 · NVIDIA/cuda-python

nikshayshrivastava · 2026-05-06T08:20:50Z

Description

Generated from cudla.h using cybind.

Files added:

cycudla.pxd/pyx: Cython layer exposing C header types and functions
cudla.pxd/pyx: lowpp Python layer with POD classes, enums, and wrappers
_internal/cudla.pxd, cudla_linux.pyx, cudla_windows.pyx: dynamic library loading
docs/source/module/cudla.rst: API documentation
tests/cudla/: pytest unit tests for enums, POD types, error handling, API surface, and hardware-gated function tests (verified on L4T/Orin)

Build/CI updates:

pyproject.toml: added cudla to cuda-toolkit optional dependencies
.github/actions/fetch_ctk/action.yml: added libcudla to CTK components
docs/source/api.rst: added cudla to toctree

Checklist

New or existing tests cover these changes.
The documentation is up to date with these changes.

Generated from cudla.h using cybind. Files added: - cycudla.pxd/pyx: Cython layer exposing C header types and functions - cudla.pxd/pyx: lowpp Python layer with POD classes, enums, and wrappers - _internal/cudla.pxd, cudla_linux.pyx, cudla_windows.pyx: dynamic library loading - docs/source/module/cudla.rst: API documentation - tests/cudla/: pytest unit tests for enums, POD types, error handling, API surface, and hardware-gated function tests (verified on L4T/Orin) Build/CI updates: - pyproject.toml: added cudla to cuda-toolkit optional dependencies - .github/actions/fetch_ctk/action.yml: added libcudla to CTK components - docs/source/api.rst: added cudla to toctree

copy-pr-bot · 2026-05-06T08:20:53Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

rwgk · 2026-05-13T00:00:48Z

Note: I created the experimental-only PR #2075 to

see if the .github/actions/fetch_ctk/action.yml changes here work
Cursor review findings hold water (added tests)

rwgk · 2026-05-13T15:36:49Z

Cursor GPT-5.4 Extra High Fast

Findings

High: Non-empty fence arrays are broken in the public wrapper. WaitEvents.pre_fences and SignalEvents.eof_fences return Fence.from_ptr(..., numEvents) in cuda_bindings/cuda/bindings/cudla.pyx:1144 and cuda_bindings/cuda/bindings/cudla.pyx:1310, but Fence.from_ptr() wraps a single fence, and the setters then call len() on that scalar wrapper in cuda_bindings/cuda/bindings/cudla.pyx:1152 and cuda_bindings/cuda/bindings/cudla.pyx:1318. I verified this in TestVenv: len(cudla.Fence()) and we.pre_fences = cudla.Fence() both raise TypeError.
High: The module exposes Mode.STANDALONE in cuda_bindings/cuda/bindings/cudla.pyx:1607, but create_device() unconditionally converts that flag into ErrorUnsupportedOperation in cuda_bindings/cuda/bindings/cudla.pyx:1705, and the new tests lock that behavior in at cuda_bindings/tests/cudla/test_cudla_bindings.py:277. The official cuDLA API docs describe CUDLA_STANDALONE as a valid cudlaCreateDevice() mode, so this ships a documented mode that can never succeed.
Medium: The new reference page is describing symbols the module does not export. cuda_bindings/docs/source/module/cudla.rst:32, cuda_bindings/docs/source/module/cudla.rst:33, cuda_bindings/docs/source/module/cudla.rst:34, and cuda_bindings/docs/source/module/cudla.rst:55 publish import_external_memory, import_external_semaphore, get_nv_sci_sync_attributes, and Status.SUCCESS, but the public wrapper list stops at module_get_attributes in cuda_bindings/cuda/bindings/cudla.pxd:39. I confirmed at runtime that those three functions are absent and cudla.Status.SUCCESS is also missing.
Medium: Task.wait_events and Task.signal_events store raw pointers without retaining the Python wrapper objects they point at. Compare cuda_bindings/cuda/bindings/cudla.pyx:1516 and cuda_bindings/cuda/bindings/cudla.pyx:1527 with the _refs retention used for input_tensor and output_tensor at cuda_bindings/cuda/bindings/cudla.pyx:1484 and cuda_bindings/cuda/bindings/cudla.pyx:1509. A natural call like task.wait_events = cudla.WaitEvents() can leave Task holding a dangling pointer after GC.

Assumption

I assumed the intended contract is NVIDIA's published cuDLA API. If this PR is intentionally hybrid-only for now, the enum/docs/tests need to say that explicitly instead of advertising unsupported standalone behavior and standalone-only symbols.

Checks

TestVenv/bin/python -m pytest cuda_bindings/tests/cudla/test_cudla_bindings.py -q reports 40 passed, 6 skipped.
Targeted runtime probes in TestVenv reproduced the fence-array bug and confirmed the documented cuDLA symbols are missing.

rwgk · 2026-05-13T15:43:21Z

Main point: Look for Failing tests already in PR 2034 below.

There are 4 tests in this PR 2034 that are failing on Linux arm64. — To demonstrate that here, we have to transfer the .github/actions/fetch_ctk/action.yml fixes from PR #2075; without those fixes all builds will fail here.

Cursor GPT-5.4 Extra High Fast

PR #2075 CI failure sets overview

Run analyzed: https://github.com/NVIDIA/cuda-python/actions/runs/25775844114

Linux x86_64

Succeeded jobs:

Build jobs: Build linux-64, CUDA 13.2.1 / py3.10, Build linux-64, CUDA 13.2.1 / py3.11, Build linux-64, CUDA 13.2.1 / py3.12, Build linux-64, CUDA 13.2.1 / py3.13, Build linux-64, CUDA 13.2.1 / py3.14, Build linux-64, CUDA 13.2.1 / py3.14t
Passing test lanes: Test linux-64 / py3.10, 12.9.1, local, v100, Test linux-64 / py3.10, 13.0.2, wheels, l4, Test linux-64 / py3.11, 12.9.1, wheels, rtxpro6000, Test linux-64 / py3.11, 12.9.1, wheels, t4, wsl, Test linux-64 / py3.11, 13.0.2, local, l4, Test linux-64 / py3.12, 12.9.1, local, l4, Test linux-64 / py3.12, 13.0.2, wheels, l4, Test linux-64 / py3.13, 12.9.1, wheels, v100, Test linux-64 / py3.13, 13.0.2, local, h100, Test linux-64 / py3.13, 13.0.2, local, rtxpro6000, Test linux-64 / py3.14, 12.9.1, wheels, t4, Test linux-64 / py3.14, 13.0.2, local, l4, Test linux-64 / py3.14t, 12.9.1, local, t4, Test linux-64 / py3.14t, 13.0.2, local, l4

Failed jobs:

Test linux-64 / py3.10, 13.2.1, wheels, l4
Test linux-64 / py3.11, 13.2.1, local, l4
Test linux-64 / py3.12, 13.2.1, wheels, l4
Test linux-64 / py3.12, 13.2.1, wheels, rtx4090, wsl
Test linux-64 / py3.13, 13.2.1, local, h100
Test linux-64 / py3.13, 13.2.1, local, rtxpro6000
Test linux-64 / py3.14, 13.2.1, local, l4
Test linux-64 / py3.14, 13.2.1, local, t4(x2)
Test linux-64 / py3.14t, 13.2.1, local, h100(x2)
Test linux-64 / py3.14t, 13.2.1, local, l4

One failure cluster:

10 jobs, all of the 13.2.1 x86_64 lanes above.
Shared failure set in sampled wheels and local jobs: 7 failing cudla tests.
The 7 failures are TestDocumentedApiSurface::test_documented_functions_exist, TestDocumentedApiSurface::test_documented_status_success_member_exists, TestFenceArraySemantics::test_wait_events_pre_fences_round_trip, TestFenceArraySemantics::test_signal_events_eof_fences_round_trip, TestTaskReferenceRetention::test_task_retains_assigned_event_wrappers[wait_events-WaitEvents], TestTaskReferenceRetention::test_task_retains_assigned_event_wrappers[signal_events-SignalEvents], and TestStandaloneMode::test_create_device_accepts_standalone_mode_when_backend_supports_it.
Failing tests already in PR 2034: none

Linux arm64

Succeeded jobs:

Build jobs: Build linux-aarch64, CUDA 13.2.1 / py3.10, Build linux-aarch64, CUDA 13.2.1 / py3.11, Build linux-aarch64, CUDA 13.2.1 / py3.12, Build linux-aarch64, CUDA 13.2.1 / py3.13, Build linux-aarch64, CUDA 13.2.1 / py3.14, Build linux-aarch64, CUDA 13.2.1 / py3.14t
Passing test lanes: Test linux-aarch64 / py3.10, 12.9.1, local, a100, Test linux-aarch64 / py3.10, 13.0.2, wheels, l4, Test linux-aarch64 / py3.11, 12.9.1, wheels, l4, Test linux-aarch64 / py3.11, 13.0.2, local, a100, Test linux-aarch64 / py3.12, 12.9.1, local, a100, Test linux-aarch64 / py3.12, 13.0.2, wheels, l4, Test linux-aarch64 / py3.13, 12.9.1, wheels, l4, Test linux-aarch64 / py3.13, 13.0.2, local, a100, Test linux-aarch64 / py3.14, 12.9.1, wheels, a100, Test linux-aarch64 / py3.14, 13.0.2, local, l4, Test linux-aarch64 / py3.14t, 12.9.1, local, l4, Test linux-aarch64 / py3.14t, 13.0.2, wheels, a100

Failed jobs:

Test linux-aarch64 / py3.10, 13.2.1, wheels, a100
Test linux-aarch64 / py3.11, 13.2.1, local, l4
Test linux-aarch64 / py3.12, 13.2.1, wheels, a100
Test linux-aarch64 / py3.13, 13.2.1, local, l4
Test linux-aarch64 / py3.14, 13.2.1, local, a100
Test linux-aarch64 / py3.14t, 13.2.1, local, l4

One failure cluster:

6 jobs, all of the 13.2.1 arm64 lanes above.
Shared failure set in sampled wheels and local jobs: 10 failing cudla tests.
Group 1 within that set is the same 6 shared regressions seen on the other platforms: TestDocumentedApiSurface::test_documented_functions_exist, TestDocumentedApiSurface::test_documented_status_success_member_exists, TestFenceArraySemantics::test_wait_events_pre_fences_round_trip, TestFenceArraySemantics::test_signal_events_eof_fences_round_trip, TestTaskReferenceRetention::test_task_retains_assigned_event_wrappers[wait_events-WaitEvents], and TestTaskReferenceRetention::test_task_retains_assigned_event_wrappers[signal_events-SignalEvents].
Group 2 is arm64-specific additional runtime fallout: TestFunctions::test_get_version, TestFunctions::test_device_get_count, TestFunctions::test_create_destroy_device, and TestFunctions::test_mem_register_unregister, all ending in ValueError: 15 is not a valid Status.
TestStandaloneMode::test_create_device_accepts_standalone_mode_when_backend_supports_it is skipped on sampled arm64 jobs because the host already has a cuDLA runtime preloaded, so it does not contribute to the arm64 failure count.
Failing tests already in PR 2034:
- TestFunctions::test_get_version
- TestFunctions::test_device_get_count
- TestFunctions::test_create_destroy_device
- TestFunctions::test_mem_register_unregister

Windows x86_64

Succeeded jobs:

Build jobs: Build win-64, CUDA 13.2.1 / py3.10, Build win-64, CUDA 13.2.1 / py3.11, Build win-64, CUDA 13.2.1 / py3.12, Build win-64, CUDA 13.2.1 / py3.13, Build win-64, CUDA 13.2.1 / py3.14, Build win-64, CUDA 13.2.1 / py3.14t
Passing test lanes: Test win-64 / py3.10, 12.9.1, wheels, rtx2080 (WDDM), Test win-64 / py3.10, 13.0.2, local, rtxpro6000 (TCC), Test win-64 / py3.11, 12.9.1, local, v100 (MCDM), Test win-64 / py3.11, 13.0.2, wheels, rtx4090 (WDDM), Test win-64 / py3.12, 12.9.1, wheels, l4 (MCDM), Test win-64 / py3.12, 13.0.2, local, a100 (TCC), Test win-64 / py3.13, 12.9.1, local, l4 (TCC), Test win-64 / py3.13, 13.0.2, wheels, rtxpro6000 (MCDM), Test win-64 / py3.14, 12.9.1, wheels, v100 (TCC), Test win-64 / py3.14, 13.0.2, local, l4 (MCDM), Test win-64 / py3.14t, 12.9.1, local, l4 (TCC), Test win-64 / py3.14t, 13.0.2, wheels, a100 (MCDM)

Failed jobs:

Test win-64 / py3.10, 13.2.1, local, rtxpro6000 (TCC)
Test win-64 / py3.11, 13.2.1, wheels, rtx4090 (WDDM) (rerun job 75722332870)
Test win-64 / py3.12, 13.2.1, local, a100 (TCC)
Test win-64 / py3.13, 13.2.1, wheels, rtxpro6000 (MCDM)
Test win-64 / py3.14, 13.2.1, local, l4 (MCDM)
Test win-64 / py3.14t, 13.2.1, wheels, a100 (MCDM)

One failure cluster:

6 jobs, all of the 13.2.1 Windows lanes above.
Shared failure set in sampled local, wheels, and rerun wheels jobs: 6 failing cudla tests.
The 6 failures are TestDocumentedApiSurface::test_documented_functions_exist, TestDocumentedApiSurface::test_documented_status_success_member_exists, TestFenceArraySemantics::test_wait_events_pre_fences_round_trip, TestFenceArraySemantics::test_signal_events_eof_fences_round_trip, TestTaskReferenceRetention::test_task_retains_assigned_event_wrappers[wait_events-WaitEvents], and TestTaskReferenceRetention::test_task_retains_assigned_event_wrappers[signal_events-SignalEvents].
TestStandaloneMode::test_create_device_accepts_standalone_mode_when_backend_supports_it is skipped on Windows because the fake backend test is Linux-only.
Failing tests already in PR 2034: none

Use redistrib metadata to skip unsupported mini-CTK components and resolve archive paths through a tested helper, including container-safe workspace paths for runtime jobs.

nikshayshrivastava · 2026-05-13T19:16:37Z

/ok to test

nikshayshrivastava · 2026-05-13T19:26:37Z

/ok to test

rwgk · 2026-05-13T19:34:37Z

/ok to test 8cdc5ae

rwgk · 2026-05-13T20:38:14Z

For easy future reference, archiving the test additions under PR #2075, which go with the findings posted in an earlier comment here.

0001-Add-cuDLA-regression-tests-for-review-findings.patch

github-actions · 2026-05-13T20:58:58Z

Doc Preview CI
Preview removed because the pull request was closed or merged.

github-actions Bot added CI/CD CI/CD infrastructure cuda.bindings Everything related to the cuda.bindings module labels May 6, 2026

leofang added this to the cuda.bindings 13.3.0 & 12.9.7 milestone May 6, 2026

leofang assigned nikshayshrivastava May 6, 2026

leofang added P0 High priority - Must do! feature New feature or request and removed CI/CD CI/CD infrastructure labels May 6, 2026

fixup: ruff lint fixes

dcd089e

github-actions Bot added the CI/CD CI/CD infrastructure label May 6, 2026

nikshayshrivastava added 2 commits May 6, 2026 11:11

Add SPDX license headers to cuDLA binding files

285eeb5

fixed SPDX license headers format

04de604

nikshayshrivastava marked this pull request as ready for review May 6, 2026 18:44

leofang requested review from leofang, mdboom and rwgk and removed request for leofang May 11, 2026 21:03

rwgk mentioned this pull request May 12, 2026

EXPERIMENTAL ONLY Clone of PR #2034 #2075

Closed

rwgk and others added 4 commits May 13, 2026 09:51

Merge branch 'main' into nikshayshrivastava→cudla-bindings

774738e

Fix fetch_ctk redistrib component resolution.

902f528

Use redistrib metadata to skip unsupported mini-CTK components and resolve archive paths through a tested helper, including container-safe workspace paths for runtime jobs.

Remove cudla_windows.pyx and hardware-gated tests

fe25c33

Fix cudla.rst: remove unimplemented functions

a124181

fixed ruff failure in test_cudla_bindings.py

8cdc5ae

This comment has been minimized.

Sign in to view

rwgk approved these changes May 13, 2026

View reviewed changes

rwgk merged commit e047570 into NVIDIA:main May 13, 2026
94 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add cuDLA bindings#2034

Add cuDLA bindings#2034
rwgk merged 9 commits into
NVIDIA:mainfrom
nikshayshrivastava:cudla-bindings

nikshayshrivastava commented May 6, 2026

Uh oh!

copy-pr-bot Bot commented May 6, 2026

Uh oh!

rwgk commented May 13, 2026

Uh oh!

rwgk commented May 13, 2026

Uh oh!

rwgk commented May 13, 2026

Uh oh!

nikshayshrivastava commented May 13, 2026

Uh oh!

nikshayshrivastava commented May 13, 2026

Uh oh!

rwgk commented May 13, 2026

Uh oh!

This comment has been minimized.

rwgk commented May 13, 2026

Uh oh!

Uh oh!

github-actions Bot commented May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

nikshayshrivastava commented May 6, 2026

Description

Checklist

Uh oh!

copy-pr-bot Bot commented May 6, 2026

Uh oh!

rwgk commented May 13, 2026

Uh oh!

rwgk commented May 13, 2026

Findings

Assumption

Checks

Uh oh!

rwgk commented May 13, 2026

PR #2075 CI failure sets overview

Linux x86_64

Linux arm64

Windows x86_64

Uh oh!

nikshayshrivastava commented May 13, 2026

Uh oh!

nikshayshrivastava commented May 13, 2026

Uh oh!

rwgk commented May 13, 2026

Uh oh!

This comment has been minimized.

rwgk commented May 13, 2026

Uh oh!

Uh oh!

github-actions Bot commented May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants