-
Notifications
You must be signed in to change notification settings - Fork 340
Load NVIDIA Kernel Modules for JIT-CDI mode #975
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This change attempts to load the nvidia, nvidia-uvm, and nvidia-modeset kernel modules before generating the automatic (jit) CDI specification. The kernel modules can be controlled by the nvidia-container-runtime.modes.jit-cdi.load-kernel-modules config option. If this is set to the empty list, then no kernel modules are loaded. Errors in loading the kernel modules are logged, but ignored. Signed-off-by: Evan Lezar <[email protected]>
What is this good for? |
@@ -74,6 +74,9 @@ spec-dirs = ["/etc/cdi", "/var/run/cdi"] | |||
[nvidia-container-runtime.modes.csv] | |||
mount-spec-path = "/etc/nvidia-container-runtime/host-files-for-container.d" | |||
|
|||
[nvidia-container-runtime.modes.jit-cdi] | |||
load-kernel-modules = ["nvidia", "nvidia-uvm", "nvidia-modeset"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How did we end up making this fine selection? 🍷 🍇 🧀
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are the kernel modules that are required for various GPU functionalities. In the nvidia-container-cli
we do this through nvidia-modprobe
:
nvidia
: https://github.com/NVIDIA/libnvidia-container/blob/95d3e86522976061e856724867ebcaf75c4e9b60/src/nvc.c#L279nvidia-uvm
: https://github.com/NVIDIA/libnvidia-container/blob/95d3e86522976061e856724867ebcaf75c4e9b60/src/nvc.c#L305nvidia-modeset
: https://github.com/NVIDIA/libnvidia-container/blob/95d3e86522976061e856724867ebcaf75c4e9b60/src/nvc.c#L314
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is actually a typo. The module names should be nvidia
, nvidia_uvm
and nvidia_modeset
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, Evan.
I have updated the description with more motivation. |
@@ -192,6 +193,11 @@ func generateAutomaticCDISpec(logger logger.Interface, cfg *config.Config, devic | |||
return nil, fmt.Errorf("failed to construct CDI library: %w", err) | |||
} | |||
|
|||
// TODO: Consider moving this into the nvcdi API. | |||
if err := driver.LoadKernelModules(cfg.NVIDIAContainerRuntimeConfig.Modes.JitCDI.LoadKernelModules...); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@klueska are there any cases where we DON'T want to load / try to load the kernel modules? Note that we aslo skip this when running in a user namespace in libnvidia-container
.
This change attempts to load the nvidia, nvidia-uvm, and nvidia-modeset kernel modules (NVIDIA kernel-mode GPU drivers) before generating the automatic (jit) CDI specification. This aligns the behaviour of the JIT-CDI mode with that of the
nvidia-container-cli
.The kernel modules can be controlled by the
nvidia-container-runtime.modes.jit-cdi.load-kernel-modules
config option. If this is set to the empty list, then no kernel modules are loaded.
Errors in loading the kernel modules are logged, but ignored.