-
Notifications
You must be signed in to change notification settings - Fork 220
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NO-ISSUE: Add NVIDIA GPU operator only if there are NVIDIA GPUs #7218
base: master
Are you sure you want to change the base?
NO-ISSUE: Add NVIDIA GPU operator only if there are NVIDIA GPUs #7218
Conversation
@jhernand: This pull request explicitly references no jira issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jhernand The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
"OPENSHIFT_AI_SUPPORTED_GPUS": "1af4", | ||
}, | ||
&models.Gpu{ | ||
VendorID: "10de", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we want to make this a constant like VENDOR_ID_NVIDIA
so that we can use it wherever it is needed instead of hardwiring it in various places.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will make the nvidiagpu.nvidiaVendorID
constant public and use it.
@@ -193,3 +189,7 @@ func (o *operator) GetFeatureSupportID() models.FeatureSupportLevelID { | |||
func (o *operator) GetBundleLabels() []string { | |||
return []string(Operator.Bundles) | |||
} | |||
|
|||
func IsSupportedGpu(gpu *models.Gpu) bool { | |||
return gpu.VendorID == nvidiaVendorID |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we support any NVidia GPU?
I know that consumer cards also have this vendor ID not just the data center versions.
Do we need to consider device ID also?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The NVIDIA documentation only talks about data center GPUs: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/platform-support.html#supported-nvidia-data-center-gpus-and-systems . It doesn't say anything (positive or negative) about consumer cards.
We could improve this explicitly checking the list of GPUs listed in that document. I am not sure if that is worth.
Anyhow, if we want to do that it should be in a different pull request. This pull request only moves the existing check to the nvidiagpu
package. Please open a ticket if you believe this needs to be improved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean, I tested the GPU operator a while back and it does seem to work with desktop GPUs to some extent, I had it running on an RTX3080ti.
I suppose it's a question of supported vs not supported though, so maybe there should be a list of supported devices.
Entry( | ||
"NVIDIA is supported even if not explicitly added", | ||
map[string]string{ | ||
"OPENSHIFT_AI_SUPPORTED_GPUS": "1af4", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Trying to understand why 1AF4 is here?
Looking up the vendor ID I can see that this is a RedHat Virtio device? (of which GPU can be one?)
https://devicehunt.com/search/type/pci/vendor/1AF4/device/any
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, make this a constant, same as the suggestion for the Nvidia one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need a test case for the scenario where the Nvidia GPU has not been explicitly added?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you discovered 14fa
is the vendor identifier of VirtIO devices. We use it only for testing purposes. Setting the OPENSHIFT_AI_SUPPORTED_GPUS
environment variable to 14fa
developers and QE engineers can test the OpenShift AI operator feature, specially the validations, without requiring an actual NVIDIA GPU: they can instead use a KVM virtual machine with a virtual VirtIO GPU.
I will add the constant and the test case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually there is already a test case for the scenario where NVIDIA is not explicitly added to OPENSHIFT_AI_SUPPORTED_GPUS
, look for NVIDIA is supported even if not explicitly added
in this test.
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #7218 +/- ##
==========================================
- Coverage 67.92% 67.85% -0.08%
==========================================
Files 298 298
Lines 40710 40797 +87
==========================================
+ Hits 27654 27682 +28
- Misses 10580 10624 +44
- Partials 2476 2491 +15
|
51892af
to
02ea16b
Compare
/lgtm |
If it becomes necessary to filter by device, this could be added later. I certainly don't think it's a common use case. |
@jhernand my work is going to be delayed a bit. maybe you should merged this - I will rebase / refactor my work |
Sure |
Currently operator dependencies are only calculated when a cluster is created or updated. But certain dependencies are dynamic, and may change when new hosts are added. For example, if a cluster has the OpenShift AI operator installed, it will also require the NVIDIA GPU operator only if there are hosts that have NVIDIA GPUs. To support those dynamic dependencies this patch modifies the cluster monitor so that it recalculates the operator dependencies before checking validations. Signed-off-by: Juan Hernandez <[email protected]>
Currently when the OpenShift AI operator is enabled the NVIDIA GPU is enabled by default, even if there are no NVIDIA GPUs in the hosts. This patch changes that so that the NVIDIA GPU operator will only be added when there is at least one NVIDIA GPU present. This is a preparation to add support for other GPU operators, in particular the AMD GPU operator: we don't want to always enable all the GPU operators. Signed-off-by: Juan Hernandez <[email protected]>
02ea16b
to
72e053a
Compare
New changes are detected. LGTM label has been removed. |
@jhernand: The following test failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Currently when the OpenShift AI operator is enabled the NVIDIA GPU is enabled by default, even if there are no NVIDIA GPUs in the hosts. This patch changes that so that the NVIDIA GPU operator will only be added when there is at least one NVIDIA GPU present.
This is a preparation to add support for other GPU operators, in particular the AMD GPU operator: we don't want to always enable all the GPU operators.
List all the issues related to this PR
What environments does this code impact?
How was this code tested?
Checklist
docs
, README, etc)Reviewers Checklist