Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Use native CDI in container runtimes when supported #1285
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Use native CDI in container runtimes when supported #1285
Changes from all commits
9f867bb
72756e7
56998b2
7a07a99
78007bc
0c440fb
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question. Out of scope: Should we sort these constants to allow us to find them more easily?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we're saying that we could pull this logic into the toolkit container instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that is what I was suggesting. We would need to incorporate this logic in both the toolkit container and device-plugin actually since the configuration of both components is dependent on whether native-CDI is supported.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we not also expecting to use a runtimeclass for
cri-o
here? Also, why do we only do this if CDI is enabled?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My intent was to retain the existing behavior when CDI is disabled. That is, use the hook for
cri-o
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One slightly unrelated comment is that this in incompatible whith where we want to get with the NVIDIA Container Toolkit and we should definitely look at transitioning to a config-based mechansim for CRI-O too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is
gpuv1.Runtime
just a string. Does it make sense to update that type withSupportsCDI
andVersion
menbers instead of having separate members here?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to query the CRI API and see if CDI is enabled ? or a similar check? That would be a more robust check IMO. It's possible that we may have forks of containerd running (rehashes based on a vanilla containerd with different versioning). This check wouldn't yield the expected result in those cases
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there was some talk about exposing this for use in DRA too, but it may not officially be part of the CRI.
If I recall correctly, runtimes could send this information in the CRI status: https://github.com/containerd/containerd/blob/6f652853f01ef9ba340a860c2f39edf1701102d1/internal/cri/server/status.go#L34 and https://github.com/cri-o/cri-o/blob/02f3400b358159265d28a37df61be430404925e9/server/runtime_status.go#L15
I would be surprised if the
cdi-enabled
field "magically" propagates there.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming this information is not available via the CRI, is there any other way we could potentially get this information instead of checking version strings here in the controller?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For containerd, one could do a config dump and check whether
enable_cdi
and / or thecdi_spec_dirs
are visible? This is technically part of the runtime config, so it should actually be reported as:via the CRI api. Not sure if this is visible in the context of the GPU Operator though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see. What are your thoughts on adding this logic to the toolkit container? That is, the toolkit container would check if CDI is supported in containerd (by doing a config dump), and if supported it would ensure native-CDI is used for workloads by NOT configuring
nvidia
as the default runtime. If CDI is not supported, the toolkit container would fallback to configuringnvidia
as the default runtime.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that sounds reasonable. Do we just assume native CDI support for CRI-O?
In this mode of operation we would:
enableCDI
is presentnvidia
container runtime as a non-default runtime so that a runtime class can be created for management containers.cdi.k8s.io/
prefix for annotations and the CRI field.enableCDI
is not present5. we assume an older Containerd version that does not support CDI and we don't enable it
6. We add the
nvidia
container runtime (in CDI mode) as adefault
runtime.7. The device plugin is configured to use the
nvidia.cdi.k8s.io
prefix for annotations.For
cri-o
we would basically always follow the "IfenableCDI
is present" path above.Some questions:
docker
from the list of "supported" runtimes in the toolkit container?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the perspective of GPU Operator, I am comfortable saying yes since our next operator release will not support a K8s version that supports docker.
This is what I was envisioning as CRI-O has supported CDI since 1.23.2. Since we are shortening our K8s support matrix to n-3 at the time of release, and CRI-O follows the K8s release cycle with respect to minor versions, I think it is relatively safe to always assume native-CDI support for CRI-O. But maybe I am overlooking something.
I propose triggering this behavior whenever
CDI_ENABLED=true
is set in the toolkit container.I am assuming this is with respect to the device list strategy we configure in the plugin. I believe if we push the "native-CDI" detection logic to the toolkit container, we would have to do something similar in the plugin. That is, the behavior of the toolkit container and device-plugin when
cdi.enabled=true
in the operator is dependent on whether native-CDI is supported or not. Either we 1) run the same logic in both the toolkit container and the device-plugin to detect if native-CDI is supported, or 2) leverage thetoolkit-ready
status file to communicate information to the device-plugin that was collected in the toolkit container.I'll try to write-up this proposal in more detail before the weekend.