Skip to content

test fix: Skip NVLink version checks for inactive links#2154

Draft
rwgk wants to merge 1 commit into
NVIDIA:mainfrom
rwgk:test_system_device_test_nvlink_fix
Draft

test fix: Skip NVLink version checks for inactive links#2154
rwgk wants to merge 1 commit into
NVIDIA:mainfrom
rwgk:test_system_device_test_nvlink_fix

Conversation

@rwgk
Copy link
Copy Markdown
Contributor

@rwgk rwgk commented May 29, 2026

Context

This PR fixes a cuda.core system-test failure that was first observed while reviewing PR 2130:

The failure was seen in the original CI attempt for PR #2130. PR 2130 itself was adding coverage-oriented tests in other areas and did not modify tests/system/test_system_device.py, so the failing test was an existing system-test fragility rather than a regression introduced by that PR.

CI log with full failure details:

What Failed

The failing traceback showed that test_nvlink queried nvlink_info.version for link 0 and received NvlinkVersion.VERSION_INVALID from NVML:

tests\system\test_system_device.py:774:
>   version = nvlink_info.version

cuda\core\system\_nvlink.pxi:46:
>   raise RuntimeError("Invalid NvLink version returned for device")
E   RuntimeError: Invalid NvLink version returned for device

The relevant local values in the failure were:

link       = 0
max_links  = 18

The old test iterated over every index in range(NvlinkInfo.max_links) and queried the version before checking whether the link was active. On the failing H100 PCIe/MCDM runner, NVML reported an invalid version for at least one link slot. That is consistent with an inactive or unavailable NVLink slot, and the test should not assume that every slot up to max_links has a valid version.

Fix

This PR changes test_nvlink to query nvlink_info.state before querying nvlink_info.version.

The updated test now:

  • Retrieves the NvlinkInfo object for each possible link index.
  • Queries and validates nvlink_info.state.
  • Skips version validation for inactive links.
  • Keeps the existing version validation for active links.

This preserves the useful test invariant: if a link is active, its version should be available and well-formed. It avoids treating inactive link slots as failures.

@copy-pr-bot
Copy link
Copy Markdown
Contributor

copy-pr-bot Bot commented May 29, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@github-actions github-actions Bot added the cuda.core Everything related to the cuda.core module label May 29, 2026
@rwgk
Copy link
Copy Markdown
Contributor Author

rwgk commented May 29, 2026

PR 2130 CI Flake Report: test_system_device.py::test_nvlink

TL;DR: Look for "The strongest signal is:" below.

Workflow run: https://github.com/NVIDIA/cuda-python/actions/runs/26611106556?pr=2130

PR: #2130

2026-05-29T01:12:35.9053712Z [command]"C:\Program Files\Git\bin\git.exe" -c protocol.version=2 fetch --no-tags --prune --no-recurse-submodules --depth=1 origin +c549988f4215b2bbb703d76ffb47b66d82c28e63:refs/remotes/origin/pull-request/2130
commit c549988f4215b2bbb703d76ffb47b66d82c28e63 (HEAD -> rluo8→main, upstream/pull-request/2130, rluo8/main)
Merge: d865e33e1d 88363f8f17
Author: Rui Luo <ruluo@nvidia.com>
Date:   Thu May 28 17:46:43 2026 -0700

    Merge branch 'main' into main

Summary

PR 2130 added coverage-oriented tests under cuda_core/tests, but the observed CI failure was not in any of the newly added tests.

The original failed job was:

  • Job: Test win-64 / Python 3.14, CUDA 13.3.0 (wheels), GPU h100 (x2) (MCDM)
  • Job ID: 78418596734
  • Step: Run cuda.core tests
  • Failing test: tests/system/test_system_device.py::test_nvlink
  • Error: RuntimeError: Invalid NvLink version returned for device

Two reruns were observed:

  • Job ID 78435298937: cancelled by the 60-minute job timeout.
  • Job ID 78445744407: completed successfully.

pytest-randomly was active in all three cuda.core attempts.

Original Failure

In the original attempt, test_nvlink failed at 14% progress through the cuda.core test suite:

2026-05-29T01:20:11Z tests/system/test_system_device.py::test_nvlink FAILED [ 14%]

The failure traceback showed:

tests\system\test_system_device.py:774:
>   version = nvlink_info.version

cuda\core\system\_nvlink.pxi:46:
>   raise RuntimeError("Invalid NvLink version returned for device")
E   RuntimeError: Invalid NvLink version returned for device

The local variables shown in the traceback were:

link       = 0
max_links  = 18
nvlink_info = <cuda.core.system._device.NvlinkInfo object at ...>

The implementation in cuda/core/system/_nvlink.pxi calls nvml.device_get_nvlink_version(...) and raises if NVML returns NvlinkVersion.VERSION_INVALID.

The test in cuda_core/tests/system/test_system_device.py iterates all links from 0 to NvlinkInfo.max_links - 1 and asks for both version and state. The failure indicates that, on this runner, link 0 returned VERSION_INVALID rather than a usable NVLink version or a handled unsupported condition.

Rerun Behavior

The first rerun timed out at the job level, but test_nvlink had already run before cancellation:

2026-05-29T05:08:28Z tests/system/test_system_device.py::test_nvlink SKIPPED (Unsupported...) [ 14%]
2026-05-29T05:11:36Z ##[error]The operation was canceled.

So the timed-out rerun did not hang before reaching test_nvlink; it reached that test and skipped it successfully.

The second rerun completed successfully. In that run, test_nvlink also skipped:

2026-05-29T06:01:00Z tests/system/test_system_device.py::test_nvlink SKIPPED (Unsupported...) [ 14%]
2026-05-29T06:01:34Z ========= 3247 passed, 332 skipped, 3 xfailed, 89 warnings in 50.73s ==========

pytest-randomly State

pytest-randomly was active in all three cuda.core attempts:

Original failed attempt:

pytest-randomly      4.1.0
Using --randomly-seed=3632140741
plugins: benchmark-5.2.3, mock-3.15.1, randomly-4.1.0, repeat-0.9.4, rerunfailures-16.3, timeout-2.4.0

First rerun, cancelled by timeout:

Using --randomly-seed=4141722146
plugins: benchmark-5.2.3, mock-3.15.1, randomly-4.1.0, repeat-0.9.4, rerunfailures-16.3, timeout-2.4.0

Second rerun, successful:

Using --randomly-seed=295967675
plugins: benchmark-5.2.3, mock-3.15.1, randomly-4.1.0, repeat-0.9.4, rerunfailures-16.3, timeout-2.4.0

Observation: the differing random order may influence where test_nvlink appears in the run, but it does not by itself explain the hardware/NVML return value difference. In all three attempts, test_nvlink appeared around 14% progress.

Runner Comparison

The exact runner instance differed across all three attempts:

Attempt Job ID Result Runner name Reported GPU
Original 78418596734 Failed 24ba-w-amd-g-h100-l-2-hm6cl-runner-v7tp5 NVIDIA H100 PCIe
Rerun 1 78435298937 Cancelled by timeout 24ba-w-amd-g-h100-l-2-hm6cl-runner-j7l6c NVIDIA H100 NVL
Rerun 2 78445744407 Passed 24ba-w-amd-g-h100-l-2-hm6cl-runner-9mx8s NVIDIA H100 NVL

All three jobs used:

Current runner version: '2.334.0'
Runner group name: 'nv-gpu-amd64-h100-2gpu'
Machine name: 'NV_RUNNER'
Driver Version: 581.15
CUDA Version: 13.0
Driver mode: MCDM

The important difference is that the failing original attempt landed on an NVIDIA H100 PCIe runner, while both reruns landed on NVIDIA H100 NVL runners.

Interpretation

The failure appears to be an existing cuda.core.system test fragility or platform-specific NVML behavior, not a regression caused by PR 2130.

The strongest signal is:

  • Original run on H100 PCIe: test_nvlink failed because link 0 returned VERSION_INVALID.
  • Rerun on H100 NVL: test_nvlink skipped as unsupported.
  • Successful rerun on H100 NVL: test_nvlink skipped as unsupported.

This suggests the H100 PCIe MCDM runner exposed a different NVML response path than the H100 NVL MCDM runners. The test assumes that every index in range(NvlinkInfo.max_links) has a valid version. On the failing H100 PCIe runner, at least link 0 did not.

Possible follow-up for the owning code:

  • Adjust test_nvlink to treat VERSION_INVALID similarly to an unsupported/inactive link, or only query version after verifying the link state or availability.
  • Add diagnostic logging around device name, link index, device_get_nvlink_state, and raw device_get_nvlink_version when the version is invalid.
  • If H100 PCIe should never expose valid NVLink links in this configuration, skip test_nvlink earlier based on device/platform capability.

PR 2130 Relevance

PR 2130 only adds tests for coverage in areas such as memory, launcher, linker, program, graph memory resource, and utilities.

The failing test was:

tests/system/test_system_device.py::test_nvlink

That file was not modified by PR 2130. Based on the logs, this should be treated as an unrelated CI/platform flake rather than evidence against the PR's added tests.

@rwgk rwgk self-assigned this May 29, 2026
@rwgk rwgk requested a review from mdboom May 29, 2026 16:45
@rwgk rwgk added the P1 Medium priority - Should do label May 29, 2026
@rwgk rwgk added this to the cuda.core next milestone May 29, 2026
@rwgk rwgk added the test Improvements or additions to tests label May 29, 2026
@rwgk
Copy link
Copy Markdown
Contributor Author

rwgk commented May 29, 2026

@mdboom I'll defer running the tests until you're back.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cuda.core Everything related to the cuda.core module P1 Medium priority - Should do test Improvements or additions to tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant