Skip to content

Conversation

@louie-tsai
Copy link
Contributor

@louie-tsai louie-tsai commented Oct 29, 2025

Purpose

  • Identify a performance issue on GNR.
  • Fix the performance gap by pinning the right number of CPUs for different models, and maintain the model and #cpu mapping in CSV files as lookup tables.
  • Add some python scripts to generate right CPU id list and pinning CPU for vllm-server with a docker-compose.override.yaml file.
  • We also apply same workflows on EMR.
  • It not only help on Gaudi performance and also release other idle CPU for other CPU workloads.

docker-compose.override.yaml example

services:
vllm-server:
cpuset: "21,22,23,45,46,47,69,70,71,93,94,95,117,118,119,141,42,143"
cpus: "18
"

Test Plan

manually tested.

Test Result

GNR

By pinning different number of CPUs, we could see different throughput, TTFT and TPOT on different models.

Llama3.1 405B
For Llama3.1 405B, 18 CPU cores gave the best performance, so we map Llama3.1 405B with number of CPU "18"

image image image

Llama3.1 70B
For Llama3.1 70B, 12 CPU cores gave the best performance, so we map Llama3.1 70B with number of CPU "12"

image image image

Why performance drop when we use more CPUs?

Here are perfspect results for #CPU=18 and #CPU=24 cases.

#CPU=18
CPU Frequency is around 2300 Hz.
image

Gaudi utilization is around 40%.
image

#CPU = 24
CPU frequency dropped to ~1800 Hz
image

Gaudi utilization dropped to 30%.
image

Therefore, more CPU cores than needed might drop the CPU frequency and it also drop the Gaudi utilization due to low performance on CPU.

@github-actions
Copy link

🚧 CI Blocked

The main CI workflow was not started for the following reason:

Your branch is behind the base branch. Please merge or rebase to get the latest changes.

@github-actions
Copy link

🚧 CI Blocked

The main CI workflow was not started for the following reason:

Your branch is behind the base branch. Please merge or rebase to get the latest changes.

Copy link
Collaborator

@PatrykWo PatrykWo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct pre-commit.
The readme is quite unfriendly. Let's sync

@louie-tsai louie-tsai force-pushed the cpu_pinning branch 3 times, most recently from 1ea2924 to 8b3160f Compare October 29, 2025 22:34
@github-actions
Copy link

✅ CI Passed

All checks passed successfully against the following vllm commit:
d4aa14434397b46a562f93d0371719e62d9bd62d

@louie-tsai louie-tsai requested a review from PatrykWo October 30, 2025 02:55
Copy link
Collaborator

@PatrykWo PatrykWo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some changes are needed.

@github-actions
Copy link

🚧 CI Blocked

The main CI workflow was not started for the following reason:

Your branch is behind the base branch. Please merge or rebase to get the latest changes.

1 similar comment
@github-actions
Copy link

🚧 CI Blocked

The main CI workflow was not started for the following reason:

Your branch is behind the base branch. Please merge or rebase to get the latest changes.

@github-actions
Copy link

🚧 CI Blocked

The main CI workflow was not started for the following reason:

Your branch is behind the base branch. Please merge or rebase to get the latest changes.

1 similar comment
@github-actions
Copy link

🚧 CI Blocked

The main CI workflow was not started for the following reason:

Your branch is behind the base branch. Please merge or rebase to get the latest changes.

@github-actions
Copy link

github-actions bot commented Nov 4, 2025

✅ CI Passed

All checks passed successfully against the following vllm commit:
0384aa7150c4c9778efca041ffd1beb3ad2bd694

@github-actions
Copy link

github-actions bot commented Nov 4, 2025

✅ CI Passed

All checks passed successfully against the following vllm commit:
0384aa7150c4c9778efca041ffd1beb3ad2bd694

@louie-tsai louie-tsai force-pushed the cpu_pinning branch 2 times, most recently from 01f8f55 to 4d36048 Compare November 5, 2025 17:03
@github-actions
Copy link

github-actions bot commented Nov 5, 2025

✅ CI Passed

All checks passed successfully against the following vllm commit:
0384aa7150c4c9778efca041ffd1beb3ad2bd694

@PatrykWo PatrykWo added the enhancement New feature or request label Nov 6, 2025
@github-actions
Copy link

github-actions bot commented Nov 6, 2025

🚧 CI Blocked

The main CI workflow was not started for the following reason:

Your branch is behind the base branch. Please merge or rebase to get the latest changes.

@github-actions
Copy link

github-actions bot commented Nov 6, 2025

🚧 CI Blocked

The main CI workflow was not started for the following reason:

Your branch is behind the base branch. Please merge or rebase to get the latest changes.

1 similar comment
@github-actions
Copy link

github-actions bot commented Nov 6, 2025

🚧 CI Blocked

The main CI workflow was not started for the following reason:

Your branch is behind the base branch. Please merge or rebase to get the latest changes.

@github-actions
Copy link

github-actions bot commented Nov 6, 2025

🚧 CI Blocked

The main CI workflow was not started for the following reason:

Your branch is behind the base branch. Please merge or rebase to get the latest changes.

louie-tsai and others added 5 commits November 6, 2025 12:09
Signed-off-by: louie-tsai <[email protected]>
Signed-off-by: Tsai, Louie <[email protected]>
Signed-off-by: louie-tsai <[email protected]>
Signed-off-by: Tsai, Louie <[email protected]>
use one CPU id per core, and fallback model_id match if no input/output match

Signed-off-by: louie-tsai <[email protected]>
Signed-off-by: Tsai, Louie <[email protected]>
move all scripts under cpu_binding folder
updated README

Signed-off-by: louie-tsai <[email protected]>
Signed-off-by: Tsai, Louie <[email protected]>
@louie-tsai louie-tsai force-pushed the cpu_pinning branch 2 times, most recently from 89018b5 to 45e0745 Compare November 6, 2025 20:10
@louie-tsai louie-tsai requested a review from PatrykWo November 6, 2025 20:10
@github-actions
Copy link

github-actions bot commented Nov 6, 2025

✅ CI Passed

All checks passed successfully against the following vllm commit:
0384aa7150c4c9778efca041ffd1beb3ad2bd694

Copy link
Collaborator

@PatrykWo PatrykWo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@PatrykWo PatrykWo merged commit 2878f62 into vllm-project:main Nov 7, 2025
37 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants