NO-ISSUE: Add NVIDIA GPU operator only if there are NVIDIA GPUs #7218

jhernand · 2025-01-23T10:24:00Z

Currently when the OpenShift AI operator is enabled the NVIDIA GPU is enabled by default, even if there are no NVIDIA GPUs in the hosts. This patch changes that so that the NVIDIA GPU operator will only be added when there is at least one NVIDIA GPU present.

This is a preparation to add support for other GPU operators, in particular the AMD GPU operator: we don't want to always enable all the GPU operators.

List all the issues related to this PR

What environments does this code impact?

Automation (CI, tools, etc)
Cloud
Operator Managed Deployments
None

How was this code tested?

assisted-test-infra environment
dev-scripts environment
Reviewer's test appreciated
Waiting for CI to do a full test run
Manual (Elaborate on how it was tested)
No tests needed

Checklist

Title and description added to both, commit and PR.
Relevant issues have been associated (see CONTRIBUTING guide)
This change does not require a documentation update (docstring, docs, README, etc)
Does this change include unit-tests (note that code changes require unit-tests)

Reviewers Checklist

Are the title and description (in both PR and commit) meaningful and clear?
Is there a bug required (and linked) for this change?
Should this PR be backported?

openshift-ci-robot · 2025-01-23T10:24:04Z

openshift-ci · 2025-01-23T10:24:35Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jhernand

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [jhernand]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

jhernand · 2025-01-23T10:26:52Z

/hold

@eifrach for this to work we need to recalculate operator dependencies when new hosts are added, as suggested in #7206, otherwise in the UI workflow the NVIDIA GPU operator will never be added, because the step to add operators happens before the step to add hosts.

paul-maidment · 2025-01-23T10:48:52Z

internal/operators/openshiftai/openshift_ai_operator_test.go

+				"OPENSHIFT_AI_SUPPORTED_GPUS": "1af4",
+			},
+			&models.Gpu{
+				VendorID: "10de",


I think we want to make this a constant like VENDOR_ID_NVIDIA so that we can use it wherever it is needed instead of hardwiring it in various places.

I will make the nvidiagpu.nvidiaVendorID constant public and use it.

paul-maidment · 2025-01-23T10:50:44Z

internal/operators/nvidiagpu/nvidia_gpu_operator.go

@@ -193,3 +189,7 @@ func (o *operator) GetFeatureSupportID() models.FeatureSupportLevelID {
 func (o *operator) GetBundleLabels() []string {
 	return []string(Operator.Bundles)
 }
+
+func IsSupportedGpu(gpu *models.Gpu) bool {
+	return gpu.VendorID == nvidiaVendorID


Do we support any NVidia GPU?
I know that consumer cards also have this vendor ID not just the data center versions.
Do we need to consider device ID also?

The NVIDIA documentation only talks about data center GPUs: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/platform-support.html#supported-nvidia-data-center-gpus-and-systems . It doesn't say anything (positive or negative) about consumer cards.

We could improve this explicitly checking the list of GPUs listed in that document. I am not sure if that is worth.

Anyhow, if we want to do that it should be in a different pull request. This pull request only moves the existing check to the nvidiagpu package. Please open a ticket if you believe this needs to be improved.

I mean, I tested the GPU operator a while back and it does seem to work with desktop GPUs to some extent, I had it running on an RTX3080ti.

I suppose it's a question of supported vs not supported though, so maybe there should be a list of supported devices.

paul-maidment · 2025-01-23T10:52:54Z

internal/operators/openshiftai/openshift_ai_operator_test.go

+		Entry(
+			"NVIDIA is supported even if not explicitly added",
+			map[string]string{
+				"OPENSHIFT_AI_SUPPORTED_GPUS": "1af4",


Trying to understand why 1AF4 is here?

Looking up the vendor ID I can see that this is a RedHat Virtio device? (of which GPU can be one?)

https://devicehunt.com/search/type/pci/vendor/1AF4/device/any

Also, make this a constant, same as the suggestion for the Nvidia one

Do we need a test case for the scenario where the Nvidia GPU has not been explicitly added?

As you discovered 14fa is the vendor identifier of VirtIO devices. We use it only for testing purposes. Setting the OPENSHIFT_AI_SUPPORTED_GPUS environment variable to 14fa developers and QE engineers can test the OpenShift AI operator feature, specially the validations, without requiring an actual NVIDIA GPU: they can instead use a KVM virtual machine with a virtual VirtIO GPU.

I will add the constant and the test case.

Actually there is already a test case for the scenario where NVIDIA is not explicitly added to OPENSHIFT_AI_SUPPORTED_GPUS, look for NVIDIA is supported even if not explicitly added in this test.

codecov · 2025-01-23T11:34:36Z

Codecov Report

Attention: Patch coverage is 39.79592% with 59 lines in your changes missing coverage. Please review.

Project coverage is 67.85%. Comparing base (8dd62c1) to head (72e053a).

Files with missing lines	Patch %	Lines
internal/cluster/refresh_status_preprocessor.go	32.46%	42 Missing and 10 partials ⚠️
internal/operators/common/common.go	0.00%	5 Missing ⚠️
...nal/operators/openshiftai/openshift_ai_operator.go	77.77%	1 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #7218      +/-   ##
==========================================
- Coverage   67.92%   67.85%   -0.08%     
==========================================
  Files         298      298              
  Lines       40710    40797      +87     
==========================================
+ Hits        27654    27682      +28     
- Misses      10580    10624      +44     
- Partials     2476     2491      +15

Files with missing lines	Coverage Δ
internal/cluster/cluster.go	`65.94% <100.00%> (ø)`
internal/operators/manager.go	`79.56% <100.00%> (ø)`
...nternal/operators/nvidiagpu/nvidia_gpu_operator.go	`32.58% <100.00%> (-0.75%)`	⬇️
...nal/operators/openshiftai/openshift_ai_operator.go	`56.61% <77.77%> (+7.16%)`	⬆️
internal/operators/common/common.go	`75.00% <0.00%> (-25.00%)`	⬇️
internal/cluster/refresh_status_preprocessor.go	`72.35% <32.46%> (-21.98%)`	⬇️

... and 4 files with indirect coverage changes

paul-maidment · 2025-01-23T17:30:35Z

/lgtm

paul-maidment · 2025-01-23T17:31:05Z

If it becomes necessary to filter by device, this could be added later. I certainly don't think it's a common use case.

eifrach · 2025-01-26T08:38:56Z

@jhernand my work is going to be delayed a bit. maybe you should merged this - I will rebase / refactor my work

jhernand · 2025-01-27T08:18:14Z

@jhernand my work is going to be delayed a bit. maybe you should merged this - I will rebase / refactor my work

@eifrach is it OK if I reopen #7206 then?

eifrach · 2025-01-27T08:30:55Z

Sure

Currently operator dependencies are only calculated when a cluster is created or updated. But certain dependencies are dynamic, and may change when new hosts are added. For example, if a cluster has the OpenShift AI operator installed, it will also require the NVIDIA GPU operator only if there are hosts that have NVIDIA GPUs. To support those dynamic dependencies this patch modifies the cluster monitor so that it recalculates the operator dependencies before checking validations. Signed-off-by: Juan Hernandez <[email protected]>

jhernand · 2025-01-27T10:21:39Z

@jhernand my work is going to be delayed a bit. maybe you should merged this - I will rebase / refactor my work

@eifrach is it OK if I reopen #7206 then?

@eifrach i wasn't able to reopen #7206 because I have changed the branch. I have instead opened a new pull request: #7227.

Currently when the OpenShift AI operator is enabled the NVIDIA GPU is enabled by default, even if there are no NVIDIA GPUs in the hosts. This patch changes that so that the NVIDIA GPU operator will only be added when there is at least one NVIDIA GPU present. This is a preparation to add support for other GPU operators, in particular the AMD GPU operator: we don't want to always enable all the GPU operators. Signed-off-by: Juan Hernandez <[email protected]>

openshift-ci · 2025-01-27T10:48:27Z

New changes are detected. LGTM label has been removed.

openshift-ci · 2025-01-27T15:01:25Z

@jhernand: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/edge-e2e-ai-operator-ztp	`72e053a`	link	true	`/test edge-e2e-ai-operator-ztp`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jan 23, 2025

openshift-ci bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jan 23, 2025

openshift-ci bot requested review from rccrdpccl and romfreiman January 23, 2025 10:24

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 23, 2025

openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 23, 2025

paul-maidment reviewed Jan 23, 2025

View reviewed changes

jhernand force-pushed the add_nvidia_gpu_operator_only_if_there_are_nvidia_gpus branch from 51892af to 02ea16b Compare January 23, 2025 13:45

jhernand mentioned this pull request Jan 23, 2025

NO-ISSUE: Add amd gpu operator #7222

Open

20 tasks

openshift-ci bot assigned paul-maidment Jan 23, 2025

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jan 23, 2025

jhernand force-pushed the add_nvidia_gpu_operator_only_if_there_are_nvidia_gpus branch from 02ea16b to 72e053a Compare January 27, 2025 10:48

openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Jan 27, 2025

openshift-ci bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jan 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NO-ISSUE: Add NVIDIA GPU operator only if there are NVIDIA GPUs #7218

NO-ISSUE: Add NVIDIA GPU operator only if there are NVIDIA GPUs #7218

jhernand commented Jan 23, 2025

openshift-ci-robot commented Jan 23, 2025

List all the issues related to this PR

What environments does this code impact?

How was this code tested?

Checklist

Reviewers Checklist

openshift-ci bot commented Jan 23, 2025

jhernand commented Jan 23, 2025 •

edited

Loading

paul-maidment Jan 23, 2025

jhernand Jan 23, 2025

paul-maidment Jan 23, 2025 •

edited

Loading

jhernand Jan 23, 2025 •

edited

Loading

paul-maidment Jan 23, 2025

paul-maidment Jan 23, 2025 •

edited

Loading

paul-maidment Jan 23, 2025

paul-maidment Jan 23, 2025

jhernand Jan 23, 2025 •

edited

Loading

jhernand Jan 23, 2025

codecov bot commented Jan 23, 2025 •

edited

Loading

paul-maidment commented Jan 23, 2025

paul-maidment commented Jan 23, 2025

eifrach commented Jan 26, 2025

jhernand commented Jan 27, 2025

eifrach commented Jan 27, 2025

jhernand commented Jan 27, 2025

openshift-ci bot commented Jan 27, 2025

openshift-ci bot commented Jan 27, 2025

NO-ISSUE: Add NVIDIA GPU operator only if there are NVIDIA GPUs #7218

Are you sure you want to change the base?

NO-ISSUE: Add NVIDIA GPU operator only if there are NVIDIA GPUs #7218

Conversation

jhernand commented Jan 23, 2025

List all the issues related to this PR

What environments does this code impact?

How was this code tested?

Checklist

Reviewers Checklist

openshift-ci-robot commented Jan 23, 2025

List all the issues related to this PR

What environments does this code impact?

How was this code tested?

Checklist

Reviewers Checklist

openshift-ci bot commented Jan 23, 2025

jhernand commented Jan 23, 2025 • edited Loading

paul-maidment Jan 23, 2025

Choose a reason for hiding this comment

jhernand Jan 23, 2025

Choose a reason for hiding this comment

paul-maidment Jan 23, 2025 • edited Loading

Choose a reason for hiding this comment

jhernand Jan 23, 2025 • edited Loading

Choose a reason for hiding this comment

paul-maidment Jan 23, 2025

Choose a reason for hiding this comment

paul-maidment Jan 23, 2025 • edited Loading

Choose a reason for hiding this comment

paul-maidment Jan 23, 2025

Choose a reason for hiding this comment

paul-maidment Jan 23, 2025

Choose a reason for hiding this comment

jhernand Jan 23, 2025 • edited Loading

Choose a reason for hiding this comment

jhernand Jan 23, 2025

Choose a reason for hiding this comment

codecov bot commented Jan 23, 2025 • edited Loading

Codecov Report

paul-maidment commented Jan 23, 2025

paul-maidment commented Jan 23, 2025

eifrach commented Jan 26, 2025

jhernand commented Jan 27, 2025

eifrach commented Jan 27, 2025

jhernand commented Jan 27, 2025

openshift-ci bot commented Jan 27, 2025

openshift-ci bot commented Jan 27, 2025

jhernand commented Jan 23, 2025 •

edited

Loading

paul-maidment Jan 23, 2025 •

edited

Loading

jhernand Jan 23, 2025 •

edited

Loading

paul-maidment Jan 23, 2025 •

edited

Loading

jhernand Jan 23, 2025 •

edited

Loading

codecov bot commented Jan 23, 2025 •

edited

Loading