Skip to content

Use native CDI in container runtimes when supported #1285

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

cdesiniotis
Copy link
Contributor

@cdesiniotis cdesiniotis commented Feb 19, 2025

This PR updates our CDI implementation to leverage native CDI support in container runtimes, e.g. containerd / cri-o, when possible for application containers. As part of this PR, the cdi.default field is made a no-op.

Below are the various scenarios:

  • cdi.enabled=true
    • Default behavior:
      • The nvidia runtime is configured in cdi mode
      • The nvidia runtime is NOT configured as the default runtime
      • All operands are configured to use the nvidia runtime class
      • The device-plugin uses the standard CDI annotation prefix, cdi.k8s.io/, so that cri-o / containerd inject the CDI devices.
    • Behavior when containerd < 1.7
      • The nvidia runtime is configured in cdi mode
      • The nvidia runtime is configured as the default runtime
      • All operands are configured to use the nvidia runtime class
      • The device-plugin uses a custom CDI annotation prefix, nvidia.cdi.k8s.io/, so that the nvidia runtime injects the CDI devices.
  • cdi.enabled=false (the behavior described below is not changed with this PR)
    • containerd is the runtime
      • The nvidia runtime is configured in auto mode
      • The nvidia runtime is configured as the default runtime
      • All operands are configured to use the nvidia runtime class
      • The device-plugin's deviceListStrategy is set to envvar
    • cri-o is the runtime
      • No 'nvidia' runtimes are installed.
      • The OCI prestart hook is installed in a standard directory where cri-o looks for hooks.
      • Operands are NOT configured to use the nvidia runtime class
      • The device-plugin's deviceListStrategy is set to envvar

if runtime == gpuv1.Containerd {
// default to containerd if >=1 node running containerd
break
if runtime == gpuv1.Containerd && semver.Compare(version, "v1.7.0") < 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to query the CRI API and see if CDI is enabled ? or a similar check? That would be a more robust check IMO. It's possible that we may have forks of containerd running (rehashes based on a vanilla containerd with different versioning). This check wouldn't yield the expected result in those cases

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there was some talk about exposing this for use in DRA too, but it may not officially be part of the CRI.

If I recall correctly, runtimes could send this information in the CRI status: https://github.com/containerd/containerd/blob/6f652853f01ef9ba340a860c2f39edf1701102d1/internal/cri/server/status.go#L34 and https://github.com/cri-o/cri-o/blob/02f3400b358159265d28a37df61be430404925e9/server/runtime_status.go#L15

I would be surprised if the cdi-enabled field "magically" propagates there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming this information is not available via the CRI, is there any other way we could potentially get this information instead of checking version strings here in the controller?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For containerd, one could do a config dump and check whether enable_cdi and / or the cdi_spec_dirs are visible? This is technically part of the runtime config, so it should actually be reported as:

    "enableCDI": false,
    "cdiSpecDirs": [
      "/etc/cdi",
      "/var/run/cdi"
    ],

via the CRI api. Not sure if this is visible in the context of the GPU Operator though.

Copy link
Contributor Author

@cdesiniotis cdesiniotis Feb 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see. What are your thoughts on adding this logic to the toolkit container? That is, the toolkit container would check if CDI is supported in containerd (by doing a config dump), and if supported it would ensure native-CDI is used for workloads by NOT configuring nvidia as the default runtime. If CDI is not supported, the toolkit container would fallback to configuring nvidia as the default runtime.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that sounds reasonable. Do we just assume native CDI support for CRI-O?

In this mode of operation we would:

  1. Read the existing containerd config
  2. If enableCDI is present
    1. we know that the container engine supports CDI and it can be enabled.
    2. we add the nvidia container runtime as a non-default runtime so that a runtime class can be created for management containers.
    3. The device plugin is configured to use the cdi.k8s.io/ prefix for annotations and the CRI field.
  3. If enableCDI is not present
    5. we assume an older Containerd version that does not support CDI and we don't enable it
    6. We add the nvidia container runtime (in CDI mode) as a default runtime.
    7. The device plugin is configured to use the nvidia.cdi.k8s.io prefix for annotations.

For cri-o we would basically always follow the "If enableCDI is present" path above.

Some questions:

  • Can we remove docker from the list of "supported" runtimes in the toolkit container?
  • How do we plan to trigger this behaviour?
  • How do we known whether to use annotations or CRI?
  • Is there something we're not thinking about for CRI-O?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove docker from the list of "supported" runtimes in the toolkit container?

From the perspective of GPU Operator, I am comfortable saying yes since our next operator release will not support a K8s version that supports docker.

Do we just assume native CDI support for CRI-O?

This is what I was envisioning as CRI-O has supported CDI since 1.23.2. Since we are shortening our K8s support matrix to n-3 at the time of release, and CRI-O follows the K8s release cycle with respect to minor versions, I think it is relatively safe to always assume native-CDI support for CRI-O. But maybe I am overlooking something.

How do we plan to trigger this behavior?

I propose triggering this behavior whenever CDI_ENABLED=true is set in the toolkit container.

How do we known whether to use annotations or CRI?

I am assuming this is with respect to the device list strategy we configure in the plugin. I believe if we push the "native-CDI" detection logic to the toolkit container, we would have to do something similar in the plugin. That is, the behavior of the toolkit container and device-plugin when cdi.enabled=true in the operator is dependent on whether native-CDI is supported or not. Either we 1) run the same logic in both the toolkit container and the device-plugin to detect if native-CDI is supported, or 2) leverage the toolkit-ready status file to communicate information to the device-plugin that was collected in the toolkit container.

I'll try to write-up this proposal in more detail before the weekend.

@@ -180,6 +180,8 @@ const (
// DriverInstallDirCtrPathEnvName is the name of the envvar used by the driver-validator to represent the path
// of the driver install dir mounted in the container
DriverInstallDirCtrPathEnvName = "DRIVER_INSTALL_DIR_CTR_PATH"
// NvidiaRuntimeSetAsDefaultEnvName is the name of the toolkit container env for configuring NVIDIA Container Runtime as the default runtime
NvidiaRuntimeSetAsDefaultEnvName = "NVIDIA_RUNTIME_SET_AS_DEFAULT"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question. Out of scope: Should we sort these constants to allow us to find them more easily?

}

setContainerEnv(&(obj.Spec.Template.Spec.Containers[0]), CDIEnabledEnvName, "true")
setContainerEnv(&(obj.Spec.Template.Spec.Containers[0]), NvidiaCtrRuntimeCDIPrefixesEnvName, "nvidia.cdi.k8s.io/")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: This is would mean that native CDI support for containers will not work when using cdi-annotations in the device plugin.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I see that we set this value based on whether the runtime supports CDI in the device plugin. Does that mean that we need to optionally set that here too.

Note that I think if the runtime supports CDI and it is enabled, we would want to avoid also responding to the annotations in the toolkit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can update this so that we set this envvar conditionally.

On a first pass, I didn't think it was required to remove this setting when using native CDI support for containers -- since our device-plugin would be using a different annotation prefix, cdi.k8s.io.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated this now to conditionally set this envvar only when native CDI is not supported.

}

setContainerEnv(&(obj.Spec.Template.Spec.Containers[0]), CDIEnabledEnvName, "true")
setContainerEnv(&(obj.Spec.Template.Spec.Containers[0]), DeviceListStrategyEnvName, "envvar,cdi-annotations")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Under which conditions could we drop envvar and / or include cdi-cri here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question.

I think we should drop envvar if native CDI is supported.

As for cdi-cri, I think it depends on what versions of k8s we will support moving forward. The CDIDevices field in the CRI was alpha in 1.28 and beta since 1.29. So one could argue that we switch to cdi-cri altogether in this PR and never configure cdi-annotations as the device-list-strategy.

Copy link
Contributor Author

@cdesiniotis cdesiniotis Feb 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would the behavior be if we configure both cdi-annotations and cdi-cri? What is the behavior if we attempt to set the CRI field in the allocate response but we are running on k8s < 1.28?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated this now to drop envvar and use cdi-cri when native CDI is supported. We fallback to cdi-annotations only when runtimes do not support CDI.

}
podSpec.RuntimeClassName = &runtimeClass
func setRuntimeClass(podSpec *corev1.PodSpec, n ClusterPolicyController, runtimeClass string) {
if !n.singleton.Spec.CDI.IsEnabled() && n.runtime != gpuv1.Containerd {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we not also expecting to use a runtimeclass for cri-o here? Also, why do we only do this if CDI is enabled?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My intent was to retain the existing behavior when CDI is disabled. That is, use the hook for cri-o.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One slightly unrelated comment is that this in incompatible whith where we want to get with the NVIDIA Container Toolkit and we should definitely look at transitioning to a config-based mechansim for CRI-O too.

Comment on lines +164 to +165
runtime gpuv1.Runtime
runtimeSupportsCDI bool
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is gpuv1.Runtime just a string. Does it make sense to update that type with SupportsCDI and Version menbers instead of having separate members here?

}
return runtime, nil
version := strings.SplitAfter(runtimeVer, "//")[1]
vVersion := strings.Join([]string{"v", strings.TrimPrefix(version, "v")}, "")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the following is easier to read?

Suggested change
vVersion := strings.Join([]string{"v", strings.TrimPrefix(version, "v")}, "")
vVersion := "v" + strings.TrimPrefix(version, "v")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. Updated as you suggested.

This commit adds a 'runtimeSupportsCDI' field to the ClusterPolicyController struct.
When fetching the container runtime version strings from worker nodes, the 'runtimeSupportsCDI'
field is set accordingly. If any worker node is running containerd < 1.7.0, then we
set runtimeSupportsCDI=false.

As part of this commit, we now return an error if there are different container runtimes
running on different worker nodes.

Signed-off-by: Christopher Desiniotis <[email protected]>
…nabled=true

This commit updates the default behavior when cdi.enabled=true. If the container
runtime (containerd, cri-o) supports CDI, we leverage it to inject GPU devices
into workload containers. This means we no longer configure 'nvidia' as the
default runtime.

Signed-off-by: Christopher Desiniotis <[email protected]>
… runtime

When CDI is disabled and cri-o is the container runtime, we fallback to
installing the prestart OCI hook. Our operands will depend on the prestart
hook for getting access to GPUs.

In all other scenarios, we want our operands to leverage the 'nvidia'
runtime class to get access to GPUs.

Signed-off-by: Christopher Desiniotis <[email protected]>
Signed-off-by: Christopher Desiniotis <[email protected]>
Signed-off-by: Christopher Desiniotis <[email protected]>
Comment on lines +1209 to +1219
if !n.runtimeSupportsCDI {
setContainerEnv(&(obj.Spec.Template.Spec.Containers[0]), NvidiaCtrRuntimeCDIPrefixesEnvName, "nvidia.cdi.k8s.io/")
}

// When the container runtime supports CDI, we do not configure 'nvidia' as the default runtime.
// Instead, we leverage native CDI support in containerd / cri-o to inject GPUs into workloads.
// The 'nvidia' runtime will be set as the runtime class for our management containers so that they
// get access to all GPUs.
if n.runtimeSupportsCDI {
setContainerEnv(&(obj.Spec.Template.Spec.Containers[0]), NvidiaRuntimeSetAsDefaultEnvName, "false")
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we're saying that we could pull this logic into the toolkit container instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is what I was suggesting. We would need to incorporate this logic in both the toolkit container and device-plugin actually since the configuration of both components is dependent on whether native-CDI is supported.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants