-
Notifications
You must be signed in to change notification settings - Fork 695
WIP: Add an NRI plugin to filter which NUMA nodes a container has access to #1264
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Kevin Klues <[email protected]>
// getAllSystemNUMANodes gets all NUMA nodes in the system that have CPUs assigned to them | ||
func (p *NUMAFilterPlugin) getAllSystemNUMANodes() ([]int, error) { | ||
// Read /sys/devices/system/node/node* to get NUMA nodes | ||
nodes, err := filepath.Glob("/sys/devices/system/node/node*") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume this path does not exist on non-NUMA systems? On my laptop I seem to have NUMA support:
lscpu | grep -i numa
NUMA node(s): 1
NUMA node0 CPU(s): 0-13
$ stat /sys/devices/system/node/node0
File: /sys/devices/system/node/node0
Size: 0 Blocks: 0 IO Block: 4096 directory
I wonder if we want to log an error on those systems, or rather have an info-level log message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, we should probably return an error. I'd rather error out and have to deal with a bug report if we ever encounter a system that doesn't have NUMA enabled, than to a silently skip this without setting anything. One can always disable deployment of the plugin on such a system if they do encounter the error (once we add this option).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Piotr suggested just maintaining a list of GPU nodes vs. non-GPU nodes instead of an explicit list of system nodes. I think the rest of the logic would be robust to doing this since it operates as follows:
- When a container is being started, I first inspect what NUMA nodes it already has set
- If none are set, I set it to the full set of system nodes
- if something is set, I leave it in place but filter it against the system nodes I know about (essentially filtering out the GPU nodes)
- I then go back through 1-by1 adding the GPU nodes for the GPUs the container has access to
I think the same logic could be applied if all I tracked was GPU nodes and non-GPU nodes instead of trying to explicitly discover the system nodes.
continue | ||
} | ||
|
||
// Only include nodes with non-empty CPU lists |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, we define a "system node" to be any NUMA node that has a CPU assigned. Is that right? Is that the one, definite criterion or is there any other criterion? Should we call them cpuNodes
? Is there a better way to detect system nodes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is the only criteria. We usually refer to this kind of memory as "system memory", which is why I picked "system nodes", but I'm open to suggestions.
internal/nri/plugin.go
Outdated
return uuids, nil | ||
default: | ||
// Handle comma-separated list of UUIDs | ||
return strings.Split(value, ","), nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can value
at times be a list of integers instead of a list of UUIDs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that is why I have the todo at the top of this saying we need to handel all list-strategies
. I didn't include it in the list in the PR description though -- adding now.
|
||
// UpdateContainer is called when a container's resources are being updated. It ensures the | ||
// container remains on the correct NUMA nodes based on its assigned GPUs and MIG devices. | ||
func (p *NUMAFilterPlugin) UpdateContainer(ctx context.Context, pod *api.PodSandbox, container *api.Container, resources *api.LinuxResources) ([]*api.ContainerUpdate, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From were is this called? Ah... I assume it is the NRI lib that is calling into this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it is a callback from the NRI plugin library.
427305b
to
5ecb08b
Compare
Signed-off-by: Kevin Klues <[email protected]>
This PR implements a Node Resource Interface (NRI) plugin to filter the set of NUMA nodes each container started by Kubernetes has access to. By default, each container is given access to either (1) the full set system NUMA nodes, or (2) the set of system nodes they have already been limited to before hitting the NRI plugin. The NUMA nodes associated with each GPU or MIG device they have access to are then added in one-by-one.
Currently, the implementation only works by parsing the incoming
NVIDIA_VISIBLE_DEVICES
envar set in the container to discover which GPUs / MIG devices it has access to.Before we can merge this PR we will need to add the following features (at a minimum):