Skip to content

Fix MIG-related tests in cuda.core.system#2065

Merged
mdboom merged 4 commits into
NVIDIA:mainfrom
mdboom:nvml-qa
May 12, 2026
Merged

Fix MIG-related tests in cuda.core.system#2065
mdboom merged 4 commits into
NVIDIA:mainfrom
mdboom:nvml-qa

Conversation

@mdboom
Copy link
Copy Markdown
Contributor

@mdboom mdboom commented May 12, 2026

Fix bugs in new MIG-related functionality in cuda.core.system.

@mdboom mdboom self-assigned this May 12, 2026
@github-actions github-actions Bot added cuda.bindings Everything related to the cuda.bindings module cuda.core Everything related to the cuda.core module labels May 12, 2026
@mdboom mdboom added bug Something isn't working test Improvements or additions to tests labels May 12, 2026
@mdboom mdboom requested a review from rwgk May 12, 2026 12:16
def test_num_devices():
num_devices = system.get_num_devices()
expected_num_devices = handle_return(runtime.cudaGetDeviceCount())
assert num_devices == expected_num_devices, "Number of devices does not match expected value"
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just removed this test altogether since it isn't universally true and finding a way to write all of the exceptions seemed a little pointless.

@mdboom mdboom added this to the cuda.core next milestone May 12, 2026
@github-actions

This comment has been minimized.

@mdboom mdboom enabled auto-merge (squash) May 12, 2026 15:01
cuda_devices = get_cuda_device_names()
nvml_devices = get_nvml_device_names()

if any("Thor" in device["name"] for device in nvml_devices):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know if this could also affect "Orin"?

@leofang leofang added the P0 High priority - Must do! label May 12, 2026
@mdboom mdboom merged commit 870956d into NVIDIA:main May 12, 2026
97 checks passed
@github-actions
Copy link
Copy Markdown

Doc Preview CI
Preview removed because the pull request was closed or merged.

@rwgk
Copy link
Copy Markdown
Contributor

rwgk commented May 12, 2026

Cursor GPT-5.4 Extra High Fast


Findings

  • High cuda_core/cuda/core/system/_system.pyx:105, cuda_core/cuda/core/_device.pyx:1003, cuda_core/tests/system/test_system_system.py:43
    The Thor/MIG count mismatch from issue 309 is still unfixed. system.get_num_devices() still returns NVML's physical-device count whenever NVML is available, and cuda.core.Device.get_all_devices() still uses that value to enumerate CUDA devices. Deleting test_num_devices() removes the direct assertion on that bug, but does not fix the underlying behavior, and test_devices() still goes through the same bad path on a topology where NVML sees 1 parent GPU and CUDA sees 2 MIG devices.

  • Medium cuda_core/tests/system/test_system_device.py:58, cuda_core/cuda/core/system/_device.pyx:367
    test_to_cuda_device() now catches any RuntimeError and continues. On the Thor setup from the logs, system.Device.get_all_devices() only yields the physical NVML parent, so this test can make zero assertions and still pass. That masks both the intended MIG-parent exception and any future UUID-mapping regression.

  • Medium cuda_bindings/tests/nvml/test_cuda.py:61
    The new Thor handling is an unconditional product-name skip. That disables coverage for every Thor host, not just the specific "CUDA enumerates MIG instances while NVML enumerates the parent GPU" case, and it does nothing for the same topology if it shows up under a different product name.

  • Medium cuda_bindings/tests/nvml/conftest.py:103, cuda_bindings/tests/nvml/conftest.py:118, cuda_bindings/tests/nvml/test_pynvml.py:45
    mig_handles() still derives the MIG index upper bound from handles[0] and then reuses that bound for every GPU. On mixed systems, if GPU 0 has fewer or zero MIG slots than a later GPU, the fixture becomes incomplete or empty and test_device_get_attributes() skips even though live MIG handles exist.

  • Low cuda_core/tests/test_device.py:50, cuda_core/cuda/core/system/_device.pyx:899
    The new pci_info() behavior is effectively untested. contextlib.suppress(RuntimeError) hides all pci_info failures, not just the expected MIG-without-PCI case, so a real regression on a normal device would also pass silently.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working cuda.bindings Everything related to the cuda.bindings module cuda.core Everything related to the cuda.core module P0 High priority - Must do! test Improvements or additions to tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants