-
Notifications
You must be signed in to change notification settings - Fork 71
[New Feature] Add cpu core pinning to vllm-server to improve performance. #502
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🚧 CI BlockedThe main CI workflow was not started for the following reason:
|
ffe97f3 to
403c9c6
Compare
🚧 CI BlockedThe main CI workflow was not started for the following reason:
|
403c9c6 to
c070b05
Compare
PatrykWo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct pre-commit.
The readme is quite unfriendly. Let's sync
1ea2924 to
8b3160f
Compare
✅ CI PassedAll checks passed successfully against the following vllm commit: |
PatrykWo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some changes are needed.
🚧 CI BlockedThe main CI workflow was not started for the following reason:
|
1 similar comment
🚧 CI BlockedThe main CI workflow was not started for the following reason:
|
64ec550 to
df4933d
Compare
🚧 CI BlockedThe main CI workflow was not started for the following reason:
|
1 similar comment
🚧 CI BlockedThe main CI workflow was not started for the following reason:
|
3b581d8 to
a5c5ce0
Compare
11a1487 to
ee6ea33
Compare
✅ CI PassedAll checks passed successfully against the following vllm commit: |
ee6ea33 to
ca9a0fa
Compare
✅ CI PassedAll checks passed successfully against the following vllm commit: |
01f8f55 to
4d36048
Compare
✅ CI PassedAll checks passed successfully against the following vllm commit: |
🚧 CI BlockedThe main CI workflow was not started for the following reason:
|
🚧 CI BlockedThe main CI workflow was not started for the following reason:
|
1 similar comment
🚧 CI BlockedThe main CI workflow was not started for the following reason:
|
87b5236 to
a4667e5
Compare
🚧 CI BlockedThe main CI workflow was not started for the following reason:
|
… 70B Signed-off-by: louie-tsai <[email protected]> Signed-off-by: Tsai, Louie <[email protected]>
Signed-off-by: louie-tsai <[email protected]> Signed-off-by: Tsai, Louie <[email protected]>
Signed-off-by: louie-tsai <[email protected]> Signed-off-by: Tsai, Louie <[email protected]>
use one CPU id per core, and fallback model_id match if no input/output match Signed-off-by: louie-tsai <[email protected]> Signed-off-by: Tsai, Louie <[email protected]>
move all scripts under cpu_binding folder updated README Signed-off-by: louie-tsai <[email protected]> Signed-off-by: Tsai, Louie <[email protected]>
89018b5 to
45e0745
Compare
✅ CI PassedAll checks passed successfully against the following vllm commit: |
PatrykWo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Purpose
docker-compose.override.yaml example
services:
vllm-server:
cpuset: "21,22,23,45,46,47,69,70,71,93,94,95,117,118,119,141,42,143"
cpus: "18"
Test Plan
manually tested.
Test Result
GNR
By pinning different number of CPUs, we could see different throughput, TTFT and TPOT on different models.
Llama3.1 405B
For Llama3.1 405B, 18 CPU cores gave the best performance, so we map Llama3.1 405B with number of CPU "18"
Llama3.1 70B
For Llama3.1 70B, 12 CPU cores gave the best performance, so we map Llama3.1 70B with number of CPU "12"
Why performance drop when we use more CPUs?
Here are perfspect results for #CPU=18 and #CPU=24 cases.
#CPU=18

CPU Frequency is around 2300 Hz.
Gaudi utilization is around 40%.

#CPU = 24

CPU frequency dropped to ~1800 Hz
Gaudi utilization dropped to 30%.

Therefore, more CPU cores than needed might drop the CPU frequency and it also drop the Gaudi utilization due to low performance on CPU.