Skip to content

WIP: Add an NRI plugin to filter which NUMA nodes a container has access to #1264

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

klueska
Copy link
Contributor

@klueska klueska commented May 14, 2025

This PR implements a Node Resource Interface (NRI) plugin to filter the set of NUMA nodes each container started by Kubernetes has access to. By default, each container is given access to either (1) the full set system NUMA nodes, or (2) the set of system nodes they have already been limited to before hitting the NRI plugin. The NUMA nodes associated with each GPU or MIG device they have access to are then added in one-by-one.

Currently, the implementation only works by parsing the incoming NVIDIA_VISIBLE_DEVICES envar set in the container to discover which GPUs / MIG devices it has access to.

Before we can merge this PR we will need to add the following features (at a minimum):

  1. Add a flag to disable the NRI plugin from running entirely
  2. Add a helm-chart option to toggle the enable/disable the running of the NRI plugin
  3. Support CDI injected GPU/MIG devices
  4. Support volume mounted GPU/MIG devices
  5. Support NVIDIA_VISIBLE_DEVICES with indices instead of UUIDs

Copy link

copy-pr-bot bot commented May 14, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@klueska klueska marked this pull request as draft May 14, 2025 12:37
// getAllSystemNUMANodes gets all NUMA nodes in the system that have CPUs assigned to them
func (p *NUMAFilterPlugin) getAllSystemNUMANodes() ([]int, error) {
// Read /sys/devices/system/node/node* to get NUMA nodes
nodes, err := filepath.Glob("/sys/devices/system/node/node*")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this path does not exist on non-NUMA systems? On my laptop I seem to have NUMA support:

lscpu | grep -i numa
NUMA node(s):                         1
NUMA node0 CPU(s):                    0-13
$ stat /sys/devices/system/node/node0
  File: /sys/devices/system/node/node0
  Size: 0         	Blocks: 0          IO Block: 4096   directory

I wonder if we want to log an error on those systems, or rather have an info-level log message.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we should probably return an error. I'd rather error out and have to deal with a bug report if we ever encounter a system that doesn't have NUMA enabled, than to a silently skip this without setting anything. One can always disable deployment of the plugin on such a system if they do encounter the error (once we add this option).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Piotr suggested just maintaining a list of GPU nodes vs. non-GPU nodes instead of an explicit list of system nodes. I think the rest of the logic would be robust to doing this since it operates as follows:

  • When a container is being started, I first inspect what NUMA nodes it already has set
  • If none are set, I set it to the full set of system nodes
  • if something is set, I leave it in place but filter it against the system nodes I know about (essentially filtering out the GPU nodes)
  • I then go back through 1-by1 adding the GPU nodes for the GPUs the container has access to

I think the same logic could be applied if all I tracked was GPU nodes and non-GPU nodes instead of trying to explicitly discover the system nodes.

continue
}

// Only include nodes with non-empty CPU lists
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, we define a "system node" to be any NUMA node that has a CPU assigned. Is that right? Is that the one, definite criterion or is there any other criterion? Should we call them cpuNodes? Is there a better way to detect system nodes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is the only criteria. We usually refer to this kind of memory as "system memory", which is why I picked "system nodes", but I'm open to suggestions.

return uuids, nil
default:
// Handle comma-separated list of UUIDs
return strings.Split(value, ","), nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can value at times be a list of integers instead of a list of UUIDs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is why I have the todo at the top of this saying we need to handel all list-strategies. I didn't include it in the list in the PR description though -- adding now.


// UpdateContainer is called when a container's resources are being updated. It ensures the
// container remains on the correct NUMA nodes based on its assigned GPUs and MIG devices.
func (p *NUMAFilterPlugin) UpdateContainer(ctx context.Context, pod *api.PodSandbox, container *api.Container, resources *api.LinuxResources) ([]*api.ContainerUpdate, error) {
Copy link
Contributor

@jgehrcke jgehrcke May 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From were is this called? Ah... I assume it is the NRI lib that is calling into this.

Copy link
Contributor Author

@klueska klueska May 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is a callback from the NRI plugin library.

@klueska klueska force-pushed the nri-numa-plugin branch from 68ce7ce to eea067d Compare May 15, 2025 11:46
@klueska klueska force-pushed the nri-numa-plugin branch 2 times, most recently from 427305b to 5ecb08b Compare May 16, 2025 18:44
@klueska klueska force-pushed the nri-numa-plugin branch from 5ecb08b to f75a895 Compare May 16, 2025 19:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants