Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NO-ISSUE: Add NVIDIA GPU operator only if there are NVIDIA GPUs #7218

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

jhernand
Copy link
Contributor

Currently when the OpenShift AI operator is enabled the NVIDIA GPU is enabled by default, even if there are no NVIDIA GPUs in the hosts. This patch changes that so that the NVIDIA GPU operator will only be added when there is at least one NVIDIA GPU present.

This is a preparation to add support for other GPU operators, in particular the AMD GPU operator: we don't want to always enable all the GPU operators.

List all the issues related to this PR

  • New Feature
  • Enhancement
  • Bug fix
  • Tests
  • Documentation
  • CI/CD

What environments does this code impact?

  • Automation (CI, tools, etc)
  • Cloud
  • Operator Managed Deployments
  • None

How was this code tested?

  • assisted-test-infra environment
  • dev-scripts environment
  • Reviewer's test appreciated
  • Waiting for CI to do a full test run
  • Manual (Elaborate on how it was tested)
  • No tests needed

Checklist

  • Title and description added to both, commit and PR.
  • Relevant issues have been associated (see CONTRIBUTING guide)
  • This change does not require a documentation update (docstring, docs, README, etc)
  • Does this change include unit-tests (note that code changes require unit-tests)

Reviewers Checklist

  • Are the title and description (in both PR and commit) meaningful and clear?
  • Is there a bug required (and linked) for this change?
  • Should this PR be backported?

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jan 23, 2025
@openshift-ci-robot
Copy link

@jhernand: This pull request explicitly references no jira issue.

In response to this:

Currently when the OpenShift AI operator is enabled the NVIDIA GPU is enabled by default, even if there are no NVIDIA GPUs in the hosts. This patch changes that so that the NVIDIA GPU operator will only be added when there is at least one NVIDIA GPU present.

This is a preparation to add support for other GPU operators, in particular the AMD GPU operator: we don't want to always enable all the GPU operators.

List all the issues related to this PR

  • New Feature
  • Enhancement
  • Bug fix
  • Tests
  • Documentation
  • CI/CD

What environments does this code impact?

  • Automation (CI, tools, etc)
  • Cloud
  • Operator Managed Deployments
  • None

How was this code tested?

  • assisted-test-infra environment
  • dev-scripts environment
  • Reviewer's test appreciated
  • Waiting for CI to do a full test run
  • Manual (Elaborate on how it was tested)
  • No tests needed

Checklist

  • Title and description added to both, commit and PR.
  • Relevant issues have been associated (see CONTRIBUTING guide)
  • This change does not require a documentation update (docstring, docs, README, etc)
  • Does this change include unit-tests (note that code changes require unit-tests)

Reviewers Checklist

  • Are the title and description (in both PR and commit) meaningful and clear?
  • Is there a bug required (and linked) for this change?
  • Should this PR be backported?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jan 23, 2025
Copy link

openshift-ci bot commented Jan 23, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jhernand

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 23, 2025
@jhernand
Copy link
Contributor Author

jhernand commented Jan 23, 2025

/hold

@eifrach for this to work we need to recalculate operator dependencies when new hosts are added, as suggested in #7206, otherwise in the UI workflow the NVIDIA GPU operator will never be added, because the step to add operators happens before the step to add hosts.

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 23, 2025
"OPENSHIFT_AI_SUPPORTED_GPUS": "1af4",
},
&models.Gpu{
VendorID: "10de",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we want to make this a constant like VENDOR_ID_NVIDIA so that we can use it wherever it is needed instead of hardwiring it in various places.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will make the nvidiagpu.nvidiaVendorID constant public and use it.

@@ -193,3 +189,7 @@ func (o *operator) GetFeatureSupportID() models.FeatureSupportLevelID {
func (o *operator) GetBundleLabels() []string {
return []string(Operator.Bundles)
}

func IsSupportedGpu(gpu *models.Gpu) bool {
return gpu.VendorID == nvidiaVendorID
Copy link
Contributor

@paul-maidment paul-maidment Jan 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we support any NVidia GPU?
I know that consumer cards also have this vendor ID not just the data center versions.
Do we need to consider device ID also?

Copy link
Contributor Author

@jhernand jhernand Jan 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The NVIDIA documentation only talks about data center GPUs: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/platform-support.html#supported-nvidia-data-center-gpus-and-systems . It doesn't say anything (positive or negative) about consumer cards.

We could improve this explicitly checking the list of GPUs listed in that document. I am not sure if that is worth.

Anyhow, if we want to do that it should be in a different pull request. This pull request only moves the existing check to the nvidiagpu package. Please open a ticket if you believe this needs to be improved.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, I tested the GPU operator a while back and it does seem to work with desktop GPUs to some extent, I had it running on an RTX3080ti.

I suppose it's a question of supported vs not supported though, so maybe there should be a list of supported devices.

Entry(
"NVIDIA is supported even if not explicitly added",
map[string]string{
"OPENSHIFT_AI_SUPPORTED_GPUS": "1af4",
Copy link
Contributor

@paul-maidment paul-maidment Jan 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trying to understand why 1AF4 is here?

Looking up the vendor ID I can see that this is a RedHat Virtio device? (of which GPU can be one?)

https://devicehunt.com/search/type/pci/vendor/1AF4/device/any

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, make this a constant, same as the suggestion for the Nvidia one

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a test case for the scenario where the Nvidia GPU has not been explicitly added?

Copy link
Contributor Author

@jhernand jhernand Jan 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you discovered 14fa is the vendor identifier of VirtIO devices. We use it only for testing purposes. Setting the OPENSHIFT_AI_SUPPORTED_GPUS environment variable to 14fa developers and QE engineers can test the OpenShift AI operator feature, specially the validations, without requiring an actual NVIDIA GPU: they can instead use a KVM virtual machine with a virtual VirtIO GPU.

I will add the constant and the test case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually there is already a test case for the scenario where NVIDIA is not explicitly added to OPENSHIFT_AI_SUPPORTED_GPUS, look for NVIDIA is supported even if not explicitly added in this test.

Copy link

codecov bot commented Jan 23, 2025

Codecov Report

Attention: Patch coverage is 39.79592% with 59 lines in your changes missing coverage. Please review.

Project coverage is 67.85%. Comparing base (8dd62c1) to head (72e053a).

Files with missing lines Patch % Lines
internal/cluster/refresh_status_preprocessor.go 32.46% 42 Missing and 10 partials ⚠️
internal/operators/common/common.go 0.00% 5 Missing ⚠️
...nal/operators/openshiftai/openshift_ai_operator.go 77.77% 1 Missing and 1 partial ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #7218      +/-   ##
==========================================
- Coverage   67.92%   67.85%   -0.08%     
==========================================
  Files         298      298              
  Lines       40710    40797      +87     
==========================================
+ Hits        27654    27682      +28     
- Misses      10580    10624      +44     
- Partials     2476     2491      +15     
Files with missing lines Coverage Δ
internal/cluster/cluster.go 65.94% <100.00%> (ø)
internal/operators/manager.go 79.56% <100.00%> (ø)
...nternal/operators/nvidiagpu/nvidia_gpu_operator.go 32.58% <100.00%> (-0.75%) ⬇️
...nal/operators/openshiftai/openshift_ai_operator.go 56.61% <77.77%> (+7.16%) ⬆️
internal/operators/common/common.go 75.00% <0.00%> (-25.00%) ⬇️
internal/cluster/refresh_status_preprocessor.go 72.35% <32.46%> (-21.98%) ⬇️

... and 4 files with indirect coverage changes

@jhernand jhernand force-pushed the add_nvidia_gpu_operator_only_if_there_are_nvidia_gpus branch from 51892af to 02ea16b Compare January 23, 2025 13:45
@jhernand jhernand mentioned this pull request Jan 23, 2025
20 tasks
@paul-maidment
Copy link
Contributor

/lgtm

@paul-maidment
Copy link
Contributor

If it becomes necessary to filter by device, this could be added later. I certainly don't think it's a common use case.

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jan 23, 2025
@eifrach
Copy link
Contributor

eifrach commented Jan 26, 2025

@jhernand my work is going to be delayed a bit. maybe you should merged this - I will rebase / refactor my work

@jhernand
Copy link
Contributor Author

@jhernand my work is going to be delayed a bit. maybe you should merged this - I will rebase / refactor my work

@eifrach is it OK if I reopen #7206 then?

@eifrach
Copy link
Contributor

eifrach commented Jan 27, 2025

Sure

Currently operator dependencies are only calculated when a cluster is
created or updated. But certain dependencies are dynamic, and may
change when new hosts are added. For example, if a cluster has the
OpenShift AI operator installed, it will also require the NVIDIA GPU
operator only if there are hosts that have NVIDIA GPUs. To support those
dynamic dependencies this patch modifies the cluster monitor so that it
recalculates the operator dependencies before checking validations.

Signed-off-by: Juan Hernandez <[email protected]>
@jhernand
Copy link
Contributor Author

@jhernand my work is going to be delayed a bit. maybe you should merged this - I will rebase / refactor my work

@eifrach is it OK if I reopen #7206 then?

@eifrach i wasn't able to reopen #7206 because I have changed the branch. I have instead opened a new pull request: #7227.

Currently when the OpenShift AI operator is enabled the NVIDIA GPU is
enabled by default, even if there are no NVIDIA GPUs in the hosts. This
patch changes that so that the NVIDIA GPU operator will only be added
when there is at least one NVIDIA GPU present.

This is a preparation to add support for other GPU operators, in
particular the AMD GPU operator: we don't want to always enable all the
GPU operators.

Signed-off-by: Juan Hernandez <[email protected]>
@jhernand jhernand force-pushed the add_nvidia_gpu_operator_only_if_there_are_nvidia_gpus branch from 02ea16b to 72e053a Compare January 27, 2025 10:48
@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Jan 27, 2025
Copy link

openshift-ci bot commented Jan 27, 2025

New changes are detected. LGTM label has been removed.

@openshift-ci openshift-ci bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jan 27, 2025
Copy link

openshift-ci bot commented Jan 27, 2025

@jhernand: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/edge-e2e-ai-operator-ztp 72e053a link true /test edge-e2e-ai-operator-ztp

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants