WIP: Add an NRI plugin to filter which NUMA nodes a container has access to #1264

klueska · 2025-05-14T12:37:32Z

This PR implements a Node Resource Interface (NRI) plugin to filter the set of NUMA nodes each container started by Kubernetes has access to. By default, each container is given access to either (1) the full set system NUMA nodes, or (2) the set of system nodes they have already been limited to before hitting the NRI plugin. The NUMA nodes associated with each GPU or MIG device they have access to are then added in one-by-one.

Currently, the implementation only works by parsing the incoming NVIDIA_VISIBLE_DEVICES envar set in the container to discover which GPUs / MIG devices it has access to.

Before we can merge this PR we will need to add the following features (at a minimum):

Add a flag to disable the NRI plugin from running entirely
Add a helm-chart option to toggle the enable/disable the running of the NRI plugin
Support CDI injected GPU/MIG devices
Support volume mounted GPU/MIG devices
Support NVIDIA_VISIBLE_DEVICES with indices instead of UUIDs

Signed-off-by: Kevin Klues <[email protected]>

copy-pr-bot · 2025-05-14T12:37:36Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

jgehrcke · 2025-05-14T13:52:33Z

internal/nri/plugin.go

+// getAllSystemNUMANodes gets all NUMA nodes in the system that have CPUs assigned to them
+func (p *NUMAFilterPlugin) getAllSystemNUMANodes() ([]int, error) {
+	// Read /sys/devices/system/node/node* to get NUMA nodes
+	nodes, err := filepath.Glob("/sys/devices/system/node/node*")


I assume this path does not exist on non-NUMA systems? On my laptop I seem to have NUMA support:

lscpu | grep -i numa NUMA node(s): 1 NUMA node0 CPU(s): 0-13

$ stat /sys/devices/system/node/node0 File: /sys/devices/system/node/node0 Size: 0 Blocks: 0 IO Block: 4096 directory

I wonder if we want to log an error on those systems, or rather have an info-level log message.

Yeah, we should probably return an error. I'd rather error out and have to deal with a bug report if we ever encounter a system that doesn't have NUMA enabled, than to a silently skip this without setting anything. One can always disable deployment of the plugin on such a system if they do encounter the error (once we add this option).

Piotr suggested just maintaining a list of GPU nodes vs. non-GPU nodes instead of an explicit list of system nodes. I think the rest of the logic would be robust to doing this since it operates as follows:

When a container is being started, I first inspect what NUMA nodes it already has set

If none are set, I set it to the full set of system nodes

if something is set, I leave it in place but filter it against the system nodes I know about (essentially filtering out the GPU nodes)

I then go back through 1-by1 adding the GPU nodes for the GPUs the container has access to

I think the same logic could be applied if all I tracked was GPU nodes and non-GPU nodes instead of trying to explicitly discover the system nodes.

jgehrcke · 2025-05-14T13:54:31Z

internal/nri/plugin.go

+			continue
+		}
+
+		// Only include nodes with non-empty CPU lists


So, we define a "system node" to be any NUMA node that has a CPU assigned. Is that right? Is that the one, definite criterion or is there any other criterion? Should we call them cpuNodes? Is there a better way to detect system nodes?

That is the only criteria. We usually refer to this kind of memory as "system memory", which is why I picked "system nodes", but I'm open to suggestions.

jgehrcke · 2025-05-14T13:56:26Z

internal/nri/plugin.go

+				return uuids, nil
+			default:
+				// Handle comma-separated list of UUIDs
+				return strings.Split(value, ","), nil


Can value at times be a list of integers instead of a list of UUIDs?

Yes, that is why I have the todo at the top of this saying we need to handel all list-strategies. I didn't include it in the list in the PR description though -- adding now.

jgehrcke · 2025-05-14T14:01:12Z

internal/nri/plugin.go

+
+// UpdateContainer is called when a container's resources are being updated. It ensures the
+// container remains on the correct NUMA nodes based on its assigned GPUs and MIG devices.
+func (p *NUMAFilterPlugin) UpdateContainer(ctx context.Context, pod *api.PodSandbox, container *api.Container, resources *api.LinuxResources) ([]*api.ContainerUpdate, error) {


From were is this called? Ah... I assume it is the NRI lib that is calling into this.

Yes, it is a callback from the NRI plugin library.

internal/nri/.plugin.go.swp

Signed-off-by: Kevin Klues <[email protected]>

Update vendoring in preparation for NRI plugin

42e47a7

Signed-off-by: Kevin Klues <[email protected]>

klueska marked this pull request as draft May 14, 2025 12:37

jgehrcke reviewed May 14, 2025

View reviewed changes

klueska force-pushed the nri-numa-plugin branch from 68ce7ce to eea067d Compare May 15, 2025 11:46

elezar reviewed May 16, 2025

View reviewed changes

internal/nri/.plugin.go.swp Outdated Show resolved Hide resolved

klueska force-pushed the nri-numa-plugin branch 2 times, most recently from 427305b to 5ecb08b Compare May 16, 2025 18:44

WIP: Add an NRI plugin to filter which NUMA nodes a container sees

f75a895

Signed-off-by: Kevin Klues <[email protected]>

klueska force-pushed the nri-numa-plugin branch from 5ecb08b to f75a895 Compare May 16, 2025 19:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

WIP: Add an NRI plugin to filter which NUMA nodes a container has access to #1264

WIP: Add an NRI plugin to filter which NUMA nodes a container has access to #1264

klueska commented May 14, 2025 •

edited

Loading

Uh oh!

copy-pr-bot bot commented May 14, 2025

Uh oh!

jgehrcke May 14, 2025

Uh oh!

klueska May 14, 2025

Uh oh!

klueska May 15, 2025

Uh oh!

jgehrcke May 14, 2025

Uh oh!

klueska May 14, 2025

Uh oh!

jgehrcke May 14, 2025

Uh oh!

klueska May 14, 2025

Uh oh!

jgehrcke May 14, 2025 •

edited

Loading

Uh oh!

klueska May 14, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

WIP: Add an NRI plugin to filter which NUMA nodes a container has access to #1264

Are you sure you want to change the base?

WIP: Add an NRI plugin to filter which NUMA nodes a container has access to #1264

Conversation

klueska commented May 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

copy-pr-bot bot commented May 14, 2025

Uh oh!

jgehrcke May 14, 2025

Choose a reason for hiding this comment

Uh oh!

klueska May 14, 2025

Choose a reason for hiding this comment

Uh oh!

klueska May 15, 2025

Choose a reason for hiding this comment

Uh oh!

jgehrcke May 14, 2025

Choose a reason for hiding this comment

Uh oh!

klueska May 14, 2025

Choose a reason for hiding this comment

Uh oh!

jgehrcke May 14, 2025

Choose a reason for hiding this comment

Uh oh!

klueska May 14, 2025

Choose a reason for hiding this comment

Uh oh!

jgehrcke May 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

klueska May 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

klueska commented May 14, 2025 •

edited

Loading

jgehrcke May 14, 2025 •

edited

Loading

klueska May 14, 2025 •

edited

Loading