Skip to content

Conversation

vishesh92
Copy link
Member

@vishesh92 vishesh92 commented Jul 30, 2025

Description

Continuation of #11143

Thanks to Swen from ProIO, I was able to get access to a server with Nvidia A10 GPUs.

This PR contains fixes especially in the discovery for mdev devices with SR-IOV enabled. vGPU with A10 has been tested successfully with these changes.

Generated Summary

This pull request introduces several updates to improve GPU device handling, enhance vGPU discovery, and simplify the UI logic for GPU summaries. The changes include updates to the KVM hypervisor plugin, GPU discovery scripts, backend GPU service logic, and the GPU summary UI component.

Updates to GPU Device Handling and vGPU Discovery:

  1. KVM Hypervisor Plugin Enhancements:

    • Modified LibvirtGpuDef.java to include the model='vfio-pci' attribute in the XML for MDEV devices, ensuring compatibility with vfio-pci-based configurations.
  2. GPU Discovery Script Improvements:

    • Refactored gpudiscovery.sh to introduce a reusable process_mdev_instances function for handling MDEV instances, consolidating logic and improving maintainability. [1] [2]
    • Updated the XML parsing logic to correctly locate MDEV UUIDs by targeting the source/address element.
    • Added support for discovering vGPU instances on Virtual Functions (VFs) for NVIDIA SR-IOV devices.
    • Consolidated vGPU instance data into a unified JSON array for better organization.

Backend Logic Enhancements:

  1. GPU Service Updates:
    • Improved handling of GPU device types in GpuServiceImpl.java, ensuring proper assignment of VGPUOnly and passthrough types based on device capabilities. [1] [2]
    • Removed unnecessary state transitions for unmanaged GPU devices to streamline error handling.

UI Simplifications:

  1. GPU Summary Tab Refinements:
    • Simplified the logic in GPUSummaryTab.vue by removing filters that excluded passthrough profiles from card summaries, ensuring all devices are considered. [1] [2]
    • Excluded devices of type VGPUOnly from the summary calculations to focus on relevant device types.

Types of changes

  • Breaking change (fix or feature that would cause existing functionality to change)
  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Enhancement (improves an existing feature and functionality)
  • Cleanup (Code refactoring and cleanup, that may add test cases)
  • build/CI
  • test (unit or integration test code)

Feature/Enhancement Scale or Bug Severity

Feature/Enhancement Scale

  • Major
  • Minor

Screenshots (if appropriate):

How Has This Been Tested?

How did you try to break this feature and the system with this change?

Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes GPU discovery functionality to properly handle MDEV devices with SR-IOV enabled, particularly for vGPU configurations with A10 GPUs. The changes improve device discovery, XML generation, and UI handling for GPU devices.

  • Refactored GPU discovery script to better handle MDEV instances on both Physical Functions and Virtual Functions
  • Updated GPU device type assignment logic to properly categorize VGPUOnly and passthrough devices
  • Simplified GPU summary UI by removing unnecessary filtering and improving device type handling

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File Description
gpudiscovery.sh Refactored MDEV discovery with reusable function and added SR-IOV VF support
GpuServiceImpl.java Improved GPU device type assignment and parent device handling
GPUSummaryTab.vue Simplified card summary logic and excluded VGPUOnly devices from calculations
LibvirtGpuDef.java Added model='vfio-pci' attribute for MDEV device XML generation

Copy link

codecov bot commented Jul 30, 2025

Codecov Report

❌ Patch coverage is 30.00000% with 7 lines in your changes missing coverage. Please review.
✅ Project coverage is 17.17%. Comparing base (76cfcb4) to head (f9deffd).
⚠️ Report is 23 commits behind head on main.

Files with missing lines Patch % Lines
...java/org/apache/cloudstack/gpu/GpuServiceImpl.java 22.22% 5 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff            @@
##               main   #11340   +/-   ##
=========================================
  Coverage     17.17%   17.17%           
- Complexity    14993    14994    +1     
=========================================
  Files          5869     5869           
  Lines        521728   521732    +4     
  Branches      63506    63506           
=========================================
+ Hits          89604    89606    +2     
- Misses       422053   422055    +2     
  Partials      10071    10071           
Flag Coverage Δ
uitests 3.76% <ø> (+<0.01%) ⬆️
unittests 18.15% <30.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Contributor

@sureshanaparti sureshanaparti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clgtm

@vishesh92 vishesh92 force-pushed the fixup-gpu-integration branch from ebdf22a to f9deffd Compare July 30, 2025 08:45
@vishesh92
Copy link
Member Author

@blueorangutan package

@apache apache deleted a comment from blueorangutan Jul 30, 2025
@blueorangutan
Copy link

@vishesh92 a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 14466

@sureshanaparti
Copy link
Contributor

@blueorangutan test

@blueorangutan
Copy link

@sureshanaparti a [SL] Trillian-Jenkins test job (ol8 mgmt + kvm-ol8) has been kicked to run smoke tests

@DaanHoogland DaanHoogland removed their request for review July 30, 2025 12:03
@DaanHoogland
Copy link
Contributor

@vishesh92 Do we have the facilities to test this? cc @sureshanaparti

@vishesh92
Copy link
Member Author

@vishesh92 Do we have the facilities to test this? cc @sureshanaparti

I am sharing an environment with 2 nvidia A10 with @rosi-shapeblue for testing.

Copy link
Collaborator

@rosi-shapeblue rosi-shapeblue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Tested all workflows, found no underlying issues.

@sureshanaparti sureshanaparti merged commit bcd738c into apache:main Jul 31, 2025
24 of 26 checks passed
@github-project-automation github-project-automation bot moved this from In Progress to Done in Apache CloudStack 4.21.0 Jul 31, 2025
@DaanHoogland DaanHoogland deleted the fixup-gpu-integration branch July 31, 2025 13:09
@blueorangutan
Copy link

[SF] Trillian test result (tid-13977)
Environment: kvm-ol8 (x2), Advanced Networking with Mgmt server ol8
Total time taken: 91370 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr11340-t13977-kvm-ol8.zip
Smoke tests completed. 136 look OK, 9 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File
ContextSuite context=TestLoadBalance>:setup Error 0.00 test_loadbalance.py
test_delete_account Error 1518.49 test_network.py
test_delete_network_while_vm_on_it Error 1.20 test_network.py
test_deploy_vm_l2network Error 1.19 test_network.py
test_l2network_restart Error 2.31 test_network.py
ContextSuite context=TestPortForwarding>:setup Error 3.65 test_network.py
ContextSuite context=TestPublicIP>:setup Error 12.40 test_network.py
test_reboot_router Failure 0.09 test_network.py
test_releaseIP Error 6.28 test_network.py
test_releaseIP_using_IP Error 6.19 test_network.py
ContextSuite context=TestRouterRules>:setup Error 6.27 test_network.py
ContextSuite context=TestSharedNetworkWithConfigDrive>:setup Error 1521.82 test_network.py
ContextSuite context=TestAdapterTypeForNic>:setup Error 0.00 test_nic_adapter_type.py
ContextSuite context=TestIsolatedNetworksPasswdServer>:setup Error 0.00 test_password_server.py
ContextSuite context=TestPortForwardingRules>:setup Error 0.00 test_portforwardingrules.py
ContextSuite context=TestResetVmOnReboot>:setup Error 0.00 test_reset_vm_on_reboot.py
ContextSuite context=TestRouterDHCPHosts>:setup Error 0.00 test_router_dhcphosts.py
ContextSuite context=TestRouterDHCPOpts>:setup Error 0.00 test_router_dhcphosts.py
ContextSuite context=TestRouterDns>:setup Error 0.00 test_router_dns.py
test_01_snapshot_root_disk Error 3.29 test_snapshots.py
test_02_list_snapshots_with_removed_data_store Error 47.01 test_snapshots.py
ContextSuite context=TestSnapshotStandaloneBackup>:setup Error 1595.87 test_snapshots.py

dhslove pushed a commit to ablecloud-team/ablestack-cloud that referenced this pull request Aug 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Status: Done
Development

Successfully merging this pull request may close these issues.

6 participants