-
Notifications
You must be signed in to change notification settings - Fork 344
Use native CDI in container runtimes when supported #1285
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
0475776
to
2dcba9a
Compare
if runtime == gpuv1.Containerd { | ||
// default to containerd if >=1 node running containerd | ||
break | ||
if runtime == gpuv1.Containerd && semver.Compare(version, "v1.7.0") < 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to query the CRI API and see if CDI is enabled ? or a similar check? That would be a more robust check IMO. It's possible that we may have forks of containerd running (rehashes based on a vanilla containerd with different versioning). This check wouldn't yield the expected result in those cases
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there was some talk about exposing this for use in DRA too, but it may not officially be part of the CRI.
If I recall correctly, runtimes could send this information in the CRI status: https://github.com/containerd/containerd/blob/6f652853f01ef9ba340a860c2f39edf1701102d1/internal/cri/server/status.go#L34 and https://github.com/cri-o/cri-o/blob/02f3400b358159265d28a37df61be430404925e9/server/runtime_status.go#L15
I would be surprised if the cdi-enabled
field "magically" propagates there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming this information is not available via the CRI, is there any other way we could potentially get this information instead of checking version strings here in the controller?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For containerd, one could do a config dump and check whether enable_cdi
and / or the cdi_spec_dirs
are visible? This is technically part of the runtime config, so it should actually be reported as:
"enableCDI": false,
"cdiSpecDirs": [
"/etc/cdi",
"/var/run/cdi"
],
via the CRI api. Not sure if this is visible in the context of the GPU Operator though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see. What are your thoughts on adding this logic to the toolkit container? That is, the toolkit container would check if CDI is supported in containerd (by doing a config dump), and if supported it would ensure native-CDI is used for workloads by NOT configuring nvidia
as the default runtime. If CDI is not supported, the toolkit container would fallback to configuring nvidia
as the default runtime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that sounds reasonable. Do we just assume native CDI support for CRI-O?
In this mode of operation we would:
- Read the existing containerd config
- If
enableCDI
is present- we know that the container engine supports CDI and it can be enabled.
- we add the
nvidia
container runtime as a non-default runtime so that a runtime class can be created for management containers. - The device plugin is configured to use the
cdi.k8s.io/
prefix for annotations and the CRI field.
- If
enableCDI
is not present
5. we assume an older Containerd version that does not support CDI and we don't enable it
6. We add thenvidia
container runtime (in CDI mode) as adefault
runtime.
7. The device plugin is configured to use thenvidia.cdi.k8s.io
prefix for annotations.
For cri-o
we would basically always follow the "If enableCDI
is present" path above.
Some questions:
- Can we remove
docker
from the list of "supported" runtimes in the toolkit container? - How do we plan to trigger this behaviour?
- How do we known whether to use annotations or CRI?
- Is there something we're not thinking about for CRI-O?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we remove docker from the list of "supported" runtimes in the toolkit container?
From the perspective of GPU Operator, I am comfortable saying yes since our next operator release will not support a K8s version that supports docker.
Do we just assume native CDI support for CRI-O?
This is what I was envisioning as CRI-O has supported CDI since 1.23.2. Since we are shortening our K8s support matrix to n-3 at the time of release, and CRI-O follows the K8s release cycle with respect to minor versions, I think it is relatively safe to always assume native-CDI support for CRI-O. But maybe I am overlooking something.
How do we plan to trigger this behavior?
I propose triggering this behavior whenever CDI_ENABLED=true
is set in the toolkit container.
How do we known whether to use annotations or CRI?
I am assuming this is with respect to the device list strategy we configure in the plugin. I believe if we push the "native-CDI" detection logic to the toolkit container, we would have to do something similar in the plugin. That is, the behavior of the toolkit container and device-plugin when cdi.enabled=true
in the operator is dependent on whether native-CDI is supported or not. Either we 1) run the same logic in both the toolkit container and the device-plugin to detect if native-CDI is supported, or 2) leverage the toolkit-ready
status file to communicate information to the device-plugin that was collected in the toolkit container.
I'll try to write-up this proposal in more detail before the weekend.
@@ -180,6 +180,8 @@ const ( | |||
// DriverInstallDirCtrPathEnvName is the name of the envvar used by the driver-validator to represent the path | |||
// of the driver install dir mounted in the container | |||
DriverInstallDirCtrPathEnvName = "DRIVER_INSTALL_DIR_CTR_PATH" | |||
// NvidiaRuntimeSetAsDefaultEnvName is the name of the toolkit container env for configuring NVIDIA Container Runtime as the default runtime | |||
NvidiaRuntimeSetAsDefaultEnvName = "NVIDIA_RUNTIME_SET_AS_DEFAULT" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question. Out of scope: Should we sort these constants to allow us to find them more easily?
controllers/object_controls.go
Outdated
} | ||
|
||
setContainerEnv(&(obj.Spec.Template.Spec.Containers[0]), CDIEnabledEnvName, "true") | ||
setContainerEnv(&(obj.Spec.Template.Spec.Containers[0]), NvidiaCtrRuntimeCDIPrefixesEnvName, "nvidia.cdi.k8s.io/") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: This is would mean that native CDI support for containers will not work when using cdi-annotations
in the device plugin.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. I see that we set this value based on whether the runtime supports CDI in the device plugin. Does that mean that we need to optionally set that here too.
Note that I think if the runtime supports CDI and it is enabled, we would want to avoid also responding to the annotations in the toolkit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can update this so that we set this envvar conditionally.
On a first pass, I didn't think it was required to remove this setting when using native CDI support for containers -- since our device-plugin would be using a different annotation prefix, cdi.k8s.io
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have updated this now to conditionally set this envvar only when native CDI is not supported.
controllers/object_controls.go
Outdated
} | ||
|
||
setContainerEnv(&(obj.Spec.Template.Spec.Containers[0]), CDIEnabledEnvName, "true") | ||
setContainerEnv(&(obj.Spec.Template.Spec.Containers[0]), DeviceListStrategyEnvName, "envvar,cdi-annotations") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: Under which conditions could we drop envvar
and / or include cdi-cri
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question.
I think we should drop envvar if native CDI is supported.
As for cdi-cri
, I think it depends on what versions of k8s we will support moving forward. The CDIDevices
field in the CRI was alpha in 1.28 and beta since 1.29. So one could argue that we switch to cdi-cri
altogether in this PR and never configure cdi-annotations
as the device-list-strategy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would the behavior be if we configure both cdi-annotations
and cdi-cri
? What is the behavior if we attempt to set the CRI field in the allocate response but we are running on k8s < 1.28?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have updated this now to drop envvar
and use cdi-cri
when native CDI is supported. We fallback to cdi-annotations
only when runtimes do not support CDI.
} | ||
podSpec.RuntimeClassName = &runtimeClass | ||
func setRuntimeClass(podSpec *corev1.PodSpec, n ClusterPolicyController, runtimeClass string) { | ||
if !n.singleton.Spec.CDI.IsEnabled() && n.runtime != gpuv1.Containerd { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we not also expecting to use a runtimeclass for cri-o
here? Also, why do we only do this if CDI is enabled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My intent was to retain the existing behavior when CDI is disabled. That is, use the hook for cri-o
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One slightly unrelated comment is that this in incompatible whith where we want to get with the NVIDIA Container Toolkit and we should definitely look at transitioning to a config-based mechansim for CRI-O too.
runtime gpuv1.Runtime | ||
runtimeSupportsCDI bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is gpuv1.Runtime
just a string. Does it make sense to update that type with SupportsCDI
and Version
menbers instead of having separate members here?
controllers/state_manager.go
Outdated
} | ||
return runtime, nil | ||
version := strings.SplitAfter(runtimeVer, "//")[1] | ||
vVersion := strings.Join([]string{"v", strings.TrimPrefix(version, "v")}, "") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the following is easier to read?
vVersion := strings.Join([]string{"v", strings.TrimPrefix(version, "v")}, "") | |
vVersion := "v" + strings.TrimPrefix(version, "v") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree. Updated as you suggested.
This commit adds a 'runtimeSupportsCDI' field to the ClusterPolicyController struct. When fetching the container runtime version strings from worker nodes, the 'runtimeSupportsCDI' field is set accordingly. If any worker node is running containerd < 1.7.0, then we set runtimeSupportsCDI=false. As part of this commit, we now return an error if there are different container runtimes running on different worker nodes. Signed-off-by: Christopher Desiniotis <[email protected]>
…nabled=true This commit updates the default behavior when cdi.enabled=true. If the container runtime (containerd, cri-o) supports CDI, we leverage it to inject GPU devices into workload containers. This means we no longer configure 'nvidia' as the default runtime. Signed-off-by: Christopher Desiniotis <[email protected]>
… runtime When CDI is disabled and cri-o is the container runtime, we fallback to installing the prestart OCI hook. Our operands will depend on the prestart hook for getting access to GPUs. In all other scenarios, we want our operands to leverage the 'nvidia' runtime class to get access to GPUs. Signed-off-by: Christopher Desiniotis <[email protected]>
Signed-off-by: Christopher Desiniotis <[email protected]>
Signed-off-by: Christopher Desiniotis <[email protected]>
Signed-off-by: Christopher Desiniotis <[email protected]>
2dcba9a
to
0c440fb
Compare
if !n.runtimeSupportsCDI { | ||
setContainerEnv(&(obj.Spec.Template.Spec.Containers[0]), NvidiaCtrRuntimeCDIPrefixesEnvName, "nvidia.cdi.k8s.io/") | ||
} | ||
|
||
// When the container runtime supports CDI, we do not configure 'nvidia' as the default runtime. | ||
// Instead, we leverage native CDI support in containerd / cri-o to inject GPUs into workloads. | ||
// The 'nvidia' runtime will be set as the runtime class for our management containers so that they | ||
// get access to all GPUs. | ||
if n.runtimeSupportsCDI { | ||
setContainerEnv(&(obj.Spec.Template.Spec.Containers[0]), NvidiaRuntimeSetAsDefaultEnvName, "false") | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we're saying that we could pull this logic into the toolkit container instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that is what I was suggesting. We would need to incorporate this logic in both the toolkit container and device-plugin actually since the configuration of both components is dependent on whether native-CDI is supported.
This PR updates our CDI implementation to leverage native CDI support in container runtimes, e.g. containerd / cri-o, when possible for application containers. As part of this PR, the
cdi.default
field is made a no-op.Below are the various scenarios:
cdi.enabled=true
nvidia
runtime is configured incdi
modenvidia
runtime is NOT configured as the default runtimenvidia
runtime classcdi.k8s.io/
, so that cri-o / containerd inject the CDI devices.nvidia
runtime is configured incdi
modenvidia
runtime is configured as the default runtimenvidia
runtime classnvidia.cdi.k8s.io/
, so that thenvidia
runtime injects the CDI devices.cdi.enabled=false
(the behavior described below is not changed with this PR)nvidia
runtime is configured inauto
modenvidia
runtime is configured as the default runtimenvidia
runtime classdeviceListStrategy
is set toenvvar
nvidia
runtime classdeviceListStrategy
is set toenvvar