Skip to content

Add cuDLA bindings#2034

Merged
rwgk merged 9 commits into
NVIDIA:mainfrom
nikshayshrivastava:cudla-bindings
May 13, 2026
Merged

Add cuDLA bindings#2034
rwgk merged 9 commits into
NVIDIA:mainfrom
nikshayshrivastava:cudla-bindings

Conversation

@nikshayshrivastava
Copy link
Copy Markdown
Contributor

Description

Generated from cudla.h using cybind.

Files added:

  • cycudla.pxd/pyx: Cython layer exposing C header types and functions
  • cudla.pxd/pyx: lowpp Python layer with POD classes, enums, and wrappers
  • _internal/cudla.pxd, cudla_linux.pyx, cudla_windows.pyx: dynamic library loading
  • docs/source/module/cudla.rst: API documentation
  • tests/cudla/: pytest unit tests for enums, POD types, error handling, API surface, and hardware-gated function tests (verified on L4T/Orin)

Build/CI updates:

  • pyproject.toml: added cudla to cuda-toolkit optional dependencies
  • .github/actions/fetch_ctk/action.yml: added libcudla to CTK components
  • docs/source/api.rst: added cudla to toctree

Checklist

  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

Generated from cudla.h using cybind.

Files added:
- cycudla.pxd/pyx: Cython layer exposing C header types and functions
- cudla.pxd/pyx: lowpp Python layer with POD classes, enums, and wrappers
- _internal/cudla.pxd, cudla_linux.pyx, cudla_windows.pyx: dynamic library loading
- docs/source/module/cudla.rst: API documentation
- tests/cudla/: pytest unit tests for enums, POD types, error handling,
  API surface, and hardware-gated function tests (verified on L4T/Orin)

Build/CI updates:
- pyproject.toml: added cudla to cuda-toolkit optional dependencies
- .github/actions/fetch_ctk/action.yml: added libcudla to CTK components
- docs/source/api.rst: added cudla to toctree
@copy-pr-bot
Copy link
Copy Markdown
Contributor

copy-pr-bot Bot commented May 6, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions Bot added CI/CD CI/CD infrastructure cuda.bindings Everything related to the cuda.bindings module labels May 6, 2026
@leofang leofang added this to the cuda.bindings 13.3.0 & 12.9.7 milestone May 6, 2026
@leofang leofang added P0 High priority - Must do! feature New feature or request and removed CI/CD CI/CD infrastructure labels May 6, 2026
@github-actions github-actions Bot added the CI/CD CI/CD infrastructure label May 6, 2026
@nikshayshrivastava nikshayshrivastava marked this pull request as ready for review May 6, 2026 18:44
@leofang leofang requested review from leofang, mdboom and rwgk and removed request for leofang May 11, 2026 21:03
@rwgk
Copy link
Copy Markdown
Contributor

rwgk commented May 13, 2026

Note: I created the experimental-only PR #2075 to

  1. see if the .github/actions/fetch_ctk/action.yml changes here work
  2. Cursor review findings hold water (added tests)

@rwgk
Copy link
Copy Markdown
Contributor

rwgk commented May 13, 2026

Cursor GPT-5.4 Extra High Fast


Findings

  • High: Non-empty fence arrays are broken in the public wrapper. WaitEvents.pre_fences and SignalEvents.eof_fences return Fence.from_ptr(..., numEvents) in cuda_bindings/cuda/bindings/cudla.pyx:1144 and cuda_bindings/cuda/bindings/cudla.pyx:1310, but Fence.from_ptr() wraps a single fence, and the setters then call len() on that scalar wrapper in cuda_bindings/cuda/bindings/cudla.pyx:1152 and cuda_bindings/cuda/bindings/cudla.pyx:1318. I verified this in TestVenv: len(cudla.Fence()) and we.pre_fences = cudla.Fence() both raise TypeError.
  • High: The module exposes Mode.STANDALONE in cuda_bindings/cuda/bindings/cudla.pyx:1607, but create_device() unconditionally converts that flag into ErrorUnsupportedOperation in cuda_bindings/cuda/bindings/cudla.pyx:1705, and the new tests lock that behavior in at cuda_bindings/tests/cudla/test_cudla_bindings.py:277. The official cuDLA API docs describe CUDLA_STANDALONE as a valid cudlaCreateDevice() mode, so this ships a documented mode that can never succeed.
  • Medium: The new reference page is describing symbols the module does not export. cuda_bindings/docs/source/module/cudla.rst:32, cuda_bindings/docs/source/module/cudla.rst:33, cuda_bindings/docs/source/module/cudla.rst:34, and cuda_bindings/docs/source/module/cudla.rst:55 publish import_external_memory, import_external_semaphore, get_nv_sci_sync_attributes, and Status.SUCCESS, but the public wrapper list stops at module_get_attributes in cuda_bindings/cuda/bindings/cudla.pxd:39. I confirmed at runtime that those three functions are absent and cudla.Status.SUCCESS is also missing.
  • Medium: Task.wait_events and Task.signal_events store raw pointers without retaining the Python wrapper objects they point at. Compare cuda_bindings/cuda/bindings/cudla.pyx:1516 and cuda_bindings/cuda/bindings/cudla.pyx:1527 with the _refs retention used for input_tensor and output_tensor at cuda_bindings/cuda/bindings/cudla.pyx:1484 and cuda_bindings/cuda/bindings/cudla.pyx:1509. A natural call like task.wait_events = cudla.WaitEvents() can leave Task holding a dangling pointer after GC.

Assumption

  • I assumed the intended contract is NVIDIA's published cuDLA API. If this PR is intentionally hybrid-only for now, the enum/docs/tests need to say that explicitly instead of advertising unsupported standalone behavior and standalone-only symbols.

Checks

  • TestVenv/bin/python -m pytest cuda_bindings/tests/cudla/test_cudla_bindings.py -q reports 40 passed, 6 skipped.
  • Targeted runtime probes in TestVenv reproduced the fence-array bug and confirmed the documented cuDLA symbols are missing.

@rwgk
Copy link
Copy Markdown
Contributor

rwgk commented May 13, 2026

Main point: Look for Failing tests already in PR 2034 below.

There are 4 tests in this PR 2034 that are failing on Linux arm64. — To demonstrate that here, we have to transfer the .github/actions/fetch_ctk/action.yml fixes from PR #2075; without those fixes all builds will fail here.


Cursor GPT-5.4 Extra High Fast


PR #2075 CI failure sets overview

Run analyzed: https://github.com/NVIDIA/cuda-python/actions/runs/25775844114

Linux x86_64

Succeeded jobs:

  • Build jobs: Build linux-64, CUDA 13.2.1 / py3.10, Build linux-64, CUDA 13.2.1 / py3.11, Build linux-64, CUDA 13.2.1 / py3.12, Build linux-64, CUDA 13.2.1 / py3.13, Build linux-64, CUDA 13.2.1 / py3.14, Build linux-64, CUDA 13.2.1 / py3.14t
  • Passing test lanes: Test linux-64 / py3.10, 12.9.1, local, v100, Test linux-64 / py3.10, 13.0.2, wheels, l4, Test linux-64 / py3.11, 12.9.1, wheels, rtxpro6000, Test linux-64 / py3.11, 12.9.1, wheels, t4, wsl, Test linux-64 / py3.11, 13.0.2, local, l4, Test linux-64 / py3.12, 12.9.1, local, l4, Test linux-64 / py3.12, 13.0.2, wheels, l4, Test linux-64 / py3.13, 12.9.1, wheels, v100, Test linux-64 / py3.13, 13.0.2, local, h100, Test linux-64 / py3.13, 13.0.2, local, rtxpro6000, Test linux-64 / py3.14, 12.9.1, wheels, t4, Test linux-64 / py3.14, 13.0.2, local, l4, Test linux-64 / py3.14t, 12.9.1, local, t4, Test linux-64 / py3.14t, 13.0.2, local, l4

Failed jobs:

  • Test linux-64 / py3.10, 13.2.1, wheels, l4
  • Test linux-64 / py3.11, 13.2.1, local, l4
  • Test linux-64 / py3.12, 13.2.1, wheels, l4
  • Test linux-64 / py3.12, 13.2.1, wheels, rtx4090, wsl
  • Test linux-64 / py3.13, 13.2.1, local, h100
  • Test linux-64 / py3.13, 13.2.1, local, rtxpro6000
  • Test linux-64 / py3.14, 13.2.1, local, l4
  • Test linux-64 / py3.14, 13.2.1, local, t4(x2)
  • Test linux-64 / py3.14t, 13.2.1, local, h100(x2)
  • Test linux-64 / py3.14t, 13.2.1, local, l4

One failure cluster:

  • 10 jobs, all of the 13.2.1 x86_64 lanes above.
  • Shared failure set in sampled wheels and local jobs: 7 failing cudla tests.
  • The 7 failures are TestDocumentedApiSurface::test_documented_functions_exist, TestDocumentedApiSurface::test_documented_status_success_member_exists, TestFenceArraySemantics::test_wait_events_pre_fences_round_trip, TestFenceArraySemantics::test_signal_events_eof_fences_round_trip, TestTaskReferenceRetention::test_task_retains_assigned_event_wrappers[wait_events-WaitEvents], TestTaskReferenceRetention::test_task_retains_assigned_event_wrappers[signal_events-SignalEvents], and TestStandaloneMode::test_create_device_accepts_standalone_mode_when_backend_supports_it.
  • Failing tests already in PR 2034: none

Linux arm64

Succeeded jobs:

  • Build jobs: Build linux-aarch64, CUDA 13.2.1 / py3.10, Build linux-aarch64, CUDA 13.2.1 / py3.11, Build linux-aarch64, CUDA 13.2.1 / py3.12, Build linux-aarch64, CUDA 13.2.1 / py3.13, Build linux-aarch64, CUDA 13.2.1 / py3.14, Build linux-aarch64, CUDA 13.2.1 / py3.14t
  • Passing test lanes: Test linux-aarch64 / py3.10, 12.9.1, local, a100, Test linux-aarch64 / py3.10, 13.0.2, wheels, l4, Test linux-aarch64 / py3.11, 12.9.1, wheels, l4, Test linux-aarch64 / py3.11, 13.0.2, local, a100, Test linux-aarch64 / py3.12, 12.9.1, local, a100, Test linux-aarch64 / py3.12, 13.0.2, wheels, l4, Test linux-aarch64 / py3.13, 12.9.1, wheels, l4, Test linux-aarch64 / py3.13, 13.0.2, local, a100, Test linux-aarch64 / py3.14, 12.9.1, wheels, a100, Test linux-aarch64 / py3.14, 13.0.2, local, l4, Test linux-aarch64 / py3.14t, 12.9.1, local, l4, Test linux-aarch64 / py3.14t, 13.0.2, wheels, a100

Failed jobs:

  • Test linux-aarch64 / py3.10, 13.2.1, wheels, a100
  • Test linux-aarch64 / py3.11, 13.2.1, local, l4
  • Test linux-aarch64 / py3.12, 13.2.1, wheels, a100
  • Test linux-aarch64 / py3.13, 13.2.1, local, l4
  • Test linux-aarch64 / py3.14, 13.2.1, local, a100
  • Test linux-aarch64 / py3.14t, 13.2.1, local, l4

One failure cluster:

  • 6 jobs, all of the 13.2.1 arm64 lanes above.
  • Shared failure set in sampled wheels and local jobs: 10 failing cudla tests.
  • Group 1 within that set is the same 6 shared regressions seen on the other platforms: TestDocumentedApiSurface::test_documented_functions_exist, TestDocumentedApiSurface::test_documented_status_success_member_exists, TestFenceArraySemantics::test_wait_events_pre_fences_round_trip, TestFenceArraySemantics::test_signal_events_eof_fences_round_trip, TestTaskReferenceRetention::test_task_retains_assigned_event_wrappers[wait_events-WaitEvents], and TestTaskReferenceRetention::test_task_retains_assigned_event_wrappers[signal_events-SignalEvents].
  • Group 2 is arm64-specific additional runtime fallout: TestFunctions::test_get_version, TestFunctions::test_device_get_count, TestFunctions::test_create_destroy_device, and TestFunctions::test_mem_register_unregister, all ending in ValueError: 15 is not a valid Status.
  • TestStandaloneMode::test_create_device_accepts_standalone_mode_when_backend_supports_it is skipped on sampled arm64 jobs because the host already has a cuDLA runtime preloaded, so it does not contribute to the arm64 failure count.
  • Failing tests already in PR 2034:
    • TestFunctions::test_get_version
    • TestFunctions::test_device_get_count
    • TestFunctions::test_create_destroy_device
    • TestFunctions::test_mem_register_unregister

Windows x86_64

Succeeded jobs:

  • Build jobs: Build win-64, CUDA 13.2.1 / py3.10, Build win-64, CUDA 13.2.1 / py3.11, Build win-64, CUDA 13.2.1 / py3.12, Build win-64, CUDA 13.2.1 / py3.13, Build win-64, CUDA 13.2.1 / py3.14, Build win-64, CUDA 13.2.1 / py3.14t
  • Passing test lanes: Test win-64 / py3.10, 12.9.1, wheels, rtx2080 (WDDM), Test win-64 / py3.10, 13.0.2, local, rtxpro6000 (TCC), Test win-64 / py3.11, 12.9.1, local, v100 (MCDM), Test win-64 / py3.11, 13.0.2, wheels, rtx4090 (WDDM), Test win-64 / py3.12, 12.9.1, wheels, l4 (MCDM), Test win-64 / py3.12, 13.0.2, local, a100 (TCC), Test win-64 / py3.13, 12.9.1, local, l4 (TCC), Test win-64 / py3.13, 13.0.2, wheels, rtxpro6000 (MCDM), Test win-64 / py3.14, 12.9.1, wheels, v100 (TCC), Test win-64 / py3.14, 13.0.2, local, l4 (MCDM), Test win-64 / py3.14t, 12.9.1, local, l4 (TCC), Test win-64 / py3.14t, 13.0.2, wheels, a100 (MCDM)

Failed jobs:

  • Test win-64 / py3.10, 13.2.1, local, rtxpro6000 (TCC)
  • Test win-64 / py3.11, 13.2.1, wheels, rtx4090 (WDDM) (rerun job 75722332870)
  • Test win-64 / py3.12, 13.2.1, local, a100 (TCC)
  • Test win-64 / py3.13, 13.2.1, wheels, rtxpro6000 (MCDM)
  • Test win-64 / py3.14, 13.2.1, local, l4 (MCDM)
  • Test win-64 / py3.14t, 13.2.1, wheels, a100 (MCDM)

One failure cluster:

  • 6 jobs, all of the 13.2.1 Windows lanes above.
  • Shared failure set in sampled local, wheels, and rerun wheels jobs: 6 failing cudla tests.
  • The 6 failures are TestDocumentedApiSurface::test_documented_functions_exist, TestDocumentedApiSurface::test_documented_status_success_member_exists, TestFenceArraySemantics::test_wait_events_pre_fences_round_trip, TestFenceArraySemantics::test_signal_events_eof_fences_round_trip, TestTaskReferenceRetention::test_task_retains_assigned_event_wrappers[wait_events-WaitEvents], and TestTaskReferenceRetention::test_task_retains_assigned_event_wrappers[signal_events-SignalEvents].
  • TestStandaloneMode::test_create_device_accepts_standalone_mode_when_backend_supports_it is skipped on Windows because the fake backend test is Linux-only.
  • Failing tests already in PR 2034: none

rwgk and others added 4 commits May 13, 2026 09:51
Use redistrib metadata to skip unsupported mini-CTK components and resolve archive paths through a tested helper, including container-safe workspace paths for runtime jobs.
@nikshayshrivastava
Copy link
Copy Markdown
Contributor Author

/ok to test

@nikshayshrivastava
Copy link
Copy Markdown
Contributor Author

/ok to test

@rwgk
Copy link
Copy Markdown
Contributor

rwgk commented May 13, 2026

/ok to test 8cdc5ae

@github-actions

This comment has been minimized.

@rwgk
Copy link
Copy Markdown
Contributor

rwgk commented May 13, 2026

For easy future reference, archiving the test additions under PR #2075, which go with the findings posted in an earlier comment here.

0001-Add-cuDLA-regression-tests-for-review-findings.patch

@rwgk rwgk merged commit e047570 into NVIDIA:main May 13, 2026
94 checks passed
@github-actions
Copy link
Copy Markdown

Doc Preview CI
Preview removed because the pull request was closed or merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI/CD CI/CD infrastructure cuda.bindings Everything related to the cuda.bindings module feature New feature or request P0 High priority - Must do!

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants