Fixing GPU Adapter Count test to be more dynamic and fail resistent#4038
Fixing GPU Adapter Count test to be more dynamic and fail resistent#4038
Conversation
umfranci
commented
Oct 10, 2025
- The verify_gpu_adapter_count test validates GPU counts by comparing outputs from lsvmbus, lspci, and nvidia-smi commands. However, it relies on a hardcoded list of GPU models and their device IDs to identify GPUs in the lsvmbus output.
- This hardcoded approach fails when testing new GPU models, requiring manual code updates each time a new GPU hardware is released. This creates testing delays, maintenance overhead and increases failure percentage of the test.
- Hence the aim here is to implement dynamic GPU detection to automatically identify new GPU models without manual intervention, while maintaining backward compatibility with existing GPU detection logic.
- Suggested Fix:
- Primary detection: Continue using the existing hardcoded GPU list for known models
- Fallback mechanism: When no matches are found in the hardcoded list:
- Group VMBus devices by their last segment (device ID suffix)
- Identify GPU device groups where all entries are marked as "PCI Express pass-through"
- Validate the count matches nvidia-smi output for accuracy
- Direct counting: Added a new function to get GPU count directly from nvidia-smi command output, eliminating dependency on maintaining a hardcoded GPU model list
|
@squirrelsc @LiliDeng any further inputs/comments on this please? |
lisa/features/gpu.py
Outdated
|
|
||
| return 0 | ||
|
|
||
| def _get_gpu_count_by_device_id_segment(self, vmbus_devices: List[Any]) -> int: |
There was a problem hiding this comment.
It looks this method doesn't help more than the raw information. The all vmbus devices should be listed by previous commands in LISA log for troubleshooting. If the list is not long like over 50, it doesn't need to check and print again.
There was a problem hiding this comment.
True, the initial intent was to utilize this segmentation in order to try and reduce the failure rate of the test case!
There was a problem hiding this comment.
How about remove this method?
| def _has_sequential_pattern(self, devices: List[Any]) -> bool: | ||
| """ | ||
| Check if devices have sequential numbering in their IDs. | ||
| GPUs typically have patterns like 0101, 0102, 0103, 0104. |
There was a problem hiding this comment.
Where did you find this info? Could you add a link above? If there are other types of devices, maybe they’re listed in a similar way too.
There was a problem hiding this comment.
could not find an official doc for it but this was a usual trend observed for multi-GPU SKUs like GB200 and MI300. Example:
Device_ID = {56475055-0002-0000-3130-303237344131}
Device_ID = {56475055-0003-0000-3130-303237344131}
Device_ID = {56475055-0004-0000-3130-303237344131}
Device_ID = {56475055-0005-0000-3130-303237344131}
Device_ID = {56475055-0006-0000-3130-303237344131}
Device_ID = {56475055-0007-0000-3130-303237344131}
Device_ID = {56475055-0008-0000-3130-303237344131}
Device_ID = {56475055-0009-0000-3130-303237344131}
Device_ID = {00000003-0101-0000-3135-423331303142}
Device_ID = {00000203-0102-0000-3135-423331303142}
Device_ID = {00001003-0103-0001-3135-423331303142}
Device_ID = {00001203-0104-0001-3135-423331303142}
There was a problem hiding this comment.
It's not an official pattern, and maybe confusing by other devices type in future. Please remove them.