Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix CPU and memory affinity under external resource management #3012

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

askervin
Copy link

  • Fixes CPU affinity when running inference on CPU, and when CPUs are externally managed using taskset, numactl, cgroups, Kubernetes CPU manager, NRI resource policy plugins, for instance.

  • Detect external CPU management and trust the external CPU manager completely. It is more likely that external manager has the big picture of all other tasks running on the system, their QoS, hardware characteristics, etc.

  • For instance, do not modify even memory affinity, because the external manager may know better which NUMA node has fastest memory, or which NUMA nodes have enough free memory for this inference.

Fixes #3011

- Fixes CPU affinity when running inference on CPU, and when CPUs
  are externally managed using taskset, numactl, cgroups, Kubernetes
  CPU manager, NRI resource policy plugins, for instance.

- Detect external CPU management and trust the external CPU manager
  completely. It is more likely that external manager has the big picture
  of all other tasks running on the system, their QoS, hardware
  characteristics, etc.

- For instance, do not modify even memory affinity, because the external
  manager may know better which NUMA node has fastest memory, or which
  NUMA nodes have enough free memory for this inference.

Fixes: huggingface#3011

Signed-off-by: Antti Kervinen <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Resource underutilization, thread thrashing: CPU affinity ignores allowed CPUs and cannot be switched off
1 participant