Use native CDI in container runtimes when supported #1285

cdesiniotis · 2025-02-19T00:08:38Z

This PR updates our CDI implementation to leverage native CDI support in container runtimes, e.g. containerd / cri-o, when possible for application containers. As part of this PR, the cdi.default field is made a no-op.

Below are the various scenarios:

cdi.enabled=true
- Default behavior:
  - The nvidia runtime is configured in cdi mode
  - The nvidia runtime is NOT configured as the default runtime
  - All operands are configured to use the nvidia runtime class
  - The device-plugin uses the standard CDI annotation prefix, cdi.k8s.io/, so that cri-o / containerd inject the CDI devices.
- Behavior when containerd < 1.7
  - The nvidia runtime is configured in cdi mode
  - The nvidia runtime is configured as the default runtime
  - All operands are configured to use the nvidia runtime class
  - The device-plugin uses a custom CDI annotation prefix, nvidia.cdi.k8s.io/, so that the nvidia runtime injects the CDI devices.
cdi.enabled=false (the behavior described below is not changed with this PR)
- containerd is the runtime
  - The nvidia runtime is configured in auto mode
  - The nvidia runtime is configured as the default runtime
  - All operands are configured to use the nvidia runtime class
  - The device-plugin's deviceListStrategy is set to envvar
- cri-o is the runtime
  - No 'nvidia' runtimes are installed.
  - The OCI prestart hook is installed in a standard directory where cri-o looks for hooks.
  - Operands are NOT configured to use the nvidia runtime class
  - The device-plugin's deviceListStrategy is set to envvar

tests/scripts/enable-cdi.sh

api/nvidia/v1/clusterpolicy_types.go

tariq1890 · 2025-02-19T04:33:57Z

controllers/state_manager.go

-		if runtime == gpuv1.Containerd {
-			// default to containerd if >=1 node running containerd
-			break
+		if runtime == gpuv1.Containerd && semver.Compare(version, "v1.7.0") < 0 {


Is it possible to query the CRI API and see if CDI is enabled ? or a similar check? That would be a more robust check IMO. It's possible that we may have forks of containerd running (rehashes based on a vanilla containerd with different versioning). This check wouldn't yield the expected result in those cases

I think there was some talk about exposing this for use in DRA too, but it may not officially be part of the CRI.

If I recall correctly, runtimes could send this information in the CRI status: https://github.com/containerd/containerd/blob/6f652853f01ef9ba340a860c2f39edf1701102d1/internal/cri/server/status.go#L34 and https://github.com/cri-o/cri-o/blob/02f3400b358159265d28a37df61be430404925e9/server/runtime_status.go#L15

I would be surprised if the cdi-enabled field "magically" propagates there.

Assuming this information is not available via the CRI, is there any other way we could potentially get this information instead of checking version strings here in the controller?

For containerd, one could do a config dump and check whether enable_cdi and / or the cdi_spec_dirs are visible? This is technically part of the runtime config, so it should actually be reported as:

"enableCDI": false, "cdiSpecDirs": [ "/etc/cdi", "/var/run/cdi" ],

via the CRI api. Not sure if this is visible in the context of the GPU Operator though.

Ah, I see. What are your thoughts on adding this logic to the toolkit container? That is, the toolkit container would check if CDI is supported in containerd (by doing a config dump), and if supported it would ensure native-CDI is used for workloads by NOT configuring nvidia as the default runtime. If CDI is not supported, the toolkit container would fallback to configuring nvidia as the default runtime.

I think that sounds reasonable. Do we just assume native CDI support for CRI-O?

In this mode of operation we would:

Read the existing containerd config

If enableCDI is present

we know that the container engine supports CDI and it can be enabled.

we add the nvidia container runtime as a non-default runtime so that a runtime class can be created for management containers.

The device plugin is configured to use the cdi.k8s.io/ prefix for annotations and the CRI field.

If enableCDI is not present
5. we assume an older Containerd version that does not support CDI and we don't enable it
6. We add the nvidia container runtime (in CDI mode) as a default runtime.
7. The device plugin is configured to use the nvidia.cdi.k8s.io prefix for annotations.

For cri-o we would basically always follow the "If enableCDI is present" path above.

Some questions:

Can we remove docker from the list of "supported" runtimes in the toolkit container?

How do we plan to trigger this behaviour?

How do we known whether to use annotations or CRI?

Is there something we're not thinking about for CRI-O?

Can we remove docker from the list of "supported" runtimes in the toolkit container?

From the perspective of GPU Operator, I am comfortable saying yes since our next operator release will not support a K8s version that supports docker.

Do we just assume native CDI support for CRI-O?

This is what I was envisioning as CRI-O has supported CDI since 1.23.2. Since we are shortening our K8s support matrix to n-3 at the time of release, and CRI-O follows the K8s release cycle with respect to minor versions, I think it is relatively safe to always assume native-CDI support for CRI-O. But maybe I am overlooking something.

How do we plan to trigger this behavior?

I propose triggering this behavior whenever CDI_ENABLED=true is set in the toolkit container.

How do we known whether to use annotations or CRI?

I am assuming this is with respect to the device list strategy we configure in the plugin. I believe if we push the "native-CDI" detection logic to the toolkit container, we would have to do something similar in the plugin. That is, the behavior of the toolkit container and device-plugin when cdi.enabled=true in the operator is dependent on whether native-CDI is supported or not. Either we 1) run the same logic in both the toolkit container and the device-plugin to detect if native-CDI is supported, or 2) leverage the toolkit-ready status file to communicate information to the device-plugin that was collected in the toolkit container.

I'll try to write-up this proposal in more detail before the weekend.

elezar · 2025-02-20T14:00:30Z

controllers/object_controls.go

@@ -180,6 +180,8 @@ const (
 	// DriverInstallDirCtrPathEnvName is the name of the envvar used by the driver-validator to represent the path
 	// of the driver install dir mounted in the container
 	DriverInstallDirCtrPathEnvName = "DRIVER_INSTALL_DIR_CTR_PATH"
+	// NvidiaRuntimeSetAsDefaultEnvName is the name of the toolkit container env for configuring NVIDIA Container Runtime as the default runtime
+	NvidiaRuntimeSetAsDefaultEnvName = "NVIDIA_RUNTIME_SET_AS_DEFAULT"


Question. Out of scope: Should we sort these constants to allow us to find them more easily?

elezar · 2025-02-20T14:01:47Z

controllers/object_controls.go

+	}
+
+	setContainerEnv(&(obj.Spec.Template.Spec.Containers[0]), CDIEnabledEnvName, "true")
+	setContainerEnv(&(obj.Spec.Template.Spec.Containers[0]), NvidiaCtrRuntimeCDIPrefixesEnvName, "nvidia.cdi.k8s.io/")


Note: This is would mean that native CDI support for containers will not work when using cdi-annotations in the device plugin.

OK. I see that we set this value based on whether the runtime supports CDI in the device plugin. Does that mean that we need to optionally set that here too.

Note that I think if the runtime supports CDI and it is enabled, we would want to avoid also responding to the annotations in the toolkit.

I can update this so that we set this envvar conditionally.

On a first pass, I didn't think it was required to remove this setting when using native CDI support for containers -- since our device-plugin would be using a different annotation prefix, cdi.k8s.io.

I have updated this now to conditionally set this envvar only when native CDI is not supported.

elezar · 2025-02-20T14:08:00Z

controllers/object_controls.go

+	}
+
+	setContainerEnv(&(obj.Spec.Template.Spec.Containers[0]), CDIEnabledEnvName, "true")
+	setContainerEnv(&(obj.Spec.Template.Spec.Containers[0]), DeviceListStrategyEnvName, "envvar,cdi-annotations")


Question: Under which conditions could we drop envvar and / or include cdi-cri here?

Good question.

I think we should drop envvar if native CDI is supported.

As for cdi-cri, I think it depends on what versions of k8s we will support moving forward. The CDIDevices field in the CRI was alpha in 1.28 and beta since 1.29. So one could argue that we switch to cdi-cri altogether in this PR and never configure cdi-annotations as the device-list-strategy.

What would the behavior be if we configure both cdi-annotations and cdi-cri? What is the behavior if we attempt to set the CRI field in the allocate response but we are running on k8s < 1.28?

I have updated this now to drop envvar and use cdi-cri when native CDI is supported. We fallback to cdi-annotations only when runtimes do not support CDI.

elezar · 2025-02-20T14:11:35Z

controllers/object_controls.go

-		}
-		podSpec.RuntimeClassName = &runtimeClass
+func setRuntimeClass(podSpec *corev1.PodSpec, n ClusterPolicyController, runtimeClass string) {
+	if !n.singleton.Spec.CDI.IsEnabled() && n.runtime != gpuv1.Containerd {


Are we not also expecting to use a runtimeclass for cri-o here? Also, why do we only do this if CDI is enabled?

My intent was to retain the existing behavior when CDI is disabled. That is, use the hook for cri-o.

One slightly unrelated comment is that this in incompatible whith where we want to get with the NVIDIA Container Toolkit and we should definitely look at transitioning to a config-based mechansim for CRI-O too.

elezar · 2025-02-20T14:12:51Z

controllers/state_manager.go

+	runtime            gpuv1.Runtime
+	runtimeSupportsCDI bool


Is gpuv1.Runtime just a string. Does it make sense to update that type with SupportsCDI and Version menbers instead of having separate members here?

elezar · 2025-02-20T14:13:51Z

controllers/state_manager.go

 	}
-	return runtime, nil
+	version := strings.SplitAfter(runtimeVer, "//")[1]
+	vVersion := strings.Join([]string{"v", strings.TrimPrefix(version, "v")}, "")


I think the following is easier to read?

Suggested change

vVersion := strings.Join([]string{"v", strings.TrimPrefix(version, "v")}, "")

vVersion := "v" + strings.TrimPrefix(version, "v")

Agree. Updated as you suggested.

This commit adds a 'runtimeSupportsCDI' field to the ClusterPolicyController struct. When fetching the container runtime version strings from worker nodes, the 'runtimeSupportsCDI' field is set accordingly. If any worker node is running containerd < 1.7.0, then we set runtimeSupportsCDI=false. As part of this commit, we now return an error if there are different container runtimes running on different worker nodes. Signed-off-by: Christopher Desiniotis <[email protected]>

…nabled=true This commit updates the default behavior when cdi.enabled=true. If the container runtime (containerd, cri-o) supports CDI, we leverage it to inject GPU devices into workload containers. This means we no longer configure 'nvidia' as the default runtime. Signed-off-by: Christopher Desiniotis <[email protected]>

… runtime When CDI is disabled and cri-o is the container runtime, we fallback to installing the prestart OCI hook. Our operands will depend on the prestart hook for getting access to GPUs. In all other scenarios, we want our operands to leverage the 'nvidia' runtime class to get access to GPUs. Signed-off-by: Christopher Desiniotis <[email protected]>

Signed-off-by: Christopher Desiniotis <[email protected]>

elezar · 2025-02-26T13:55:51Z

controllers/object_controls.go

+	if !n.runtimeSupportsCDI {
+		setContainerEnv(&(obj.Spec.Template.Spec.Containers[0]), NvidiaCtrRuntimeCDIPrefixesEnvName, "nvidia.cdi.k8s.io/")
+	}
+
+	// When the container runtime supports CDI, we do not configure 'nvidia' as the default runtime.
+	// Instead, we leverage native CDI support in containerd / cri-o to inject GPUs into workloads.
+	// The 'nvidia' runtime will be set as the runtime class for our management containers so that they
+	// get access to all GPUs.
+	if n.runtimeSupportsCDI {
+		setContainerEnv(&(obj.Spec.Template.Spec.Containers[0]), NvidiaRuntimeSetAsDefaultEnvName, "false")
+	}


So we're saying that we could pull this logic into the toolkit container instead?

Yes, that is what I was suggesting. We would need to incorporate this logic in both the toolkit container and device-plugin actually since the configuration of both components is dependent on whether native-CDI is supported.

cdesiniotis force-pushed the use-native-cdi branch from 0475776 to 2dcba9a Compare February 19, 2025 01:13

tariq1890 reviewed Feb 19, 2025

View reviewed changes

This was referenced Feb 19, 2025

Support the DevicePluginCDIDevices feature gate #1007

Open

[Request] Support for Nvidia vGPU drivers siderolabs/extensions#461

Closed

elezar reviewed Feb 20, 2025

View reviewed changes

cdesiniotis added 6 commits February 20, 2025 08:22

Make the cdi.default field a no-op

7a07a99

Signed-off-by: Christopher Desiniotis <[email protected]>

Add e2e test for CDI

78007bc

Signed-off-by: Christopher Desiniotis <[email protected]>

Use cdi-cri device-list-strategy by default when CDI is enabled

0c440fb

Signed-off-by: Christopher Desiniotis <[email protected]>

cdesiniotis force-pushed the use-native-cdi branch from 2dcba9a to 0c440fb Compare February 20, 2025 17:08

elezar reviewed Feb 26, 2025

View reviewed changes

	vVersion := strings.Join([]string{"v", strings.TrimPrefix(version, "v")}, "")
	vVersion := "v" + strings.TrimPrefix(version, "v")

Use native CDI in container runtimes when supported #1285

Are you sure you want to change the base?

Use native CDI in container runtimes when supported #1285

Uh oh!

Conversation

cdesiniotis commented Feb 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cdesiniotis Feb 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cdesiniotis Feb 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cdesiniotis commented Feb 19, 2025 •

edited

Loading

cdesiniotis Feb 26, 2025 •

edited

Loading

cdesiniotis Feb 20, 2025 •

edited

Loading