Skip to content

Commit 2878f62

Browse files
louie-tsairomirdes
andauthored
[New Feature] Add cpu core pinning to vllm-server to improve performance. (#502)
## Purpose - Identify a performance issue on GNR. - Fix the performance gap by pinning the right number of CPUs for different models, and maintain the model and #cpu mapping in CSV files as lookup tables. - Add some python scripts to generate right CPU id list and pinning CPU for vllm-server with a docker-compose.override.yaml file. - We also apply same workflows on EMR. - It not only help on Gaudi performance and also release other idle CPU for other CPU workloads. docker-compose.override.yaml example _services: vllm-server: cpuset: "21,22,23,45,46,47,69,70,71,93,94,95,117,118,119,141,42,143" cpus: "18_" ## Test Plan manually tested. ## Test Result ### GNR By pinning different number of CPUs, we could see different throughput, TTFT and TPOT on different models. **Llama3.1 405B** For Llama3.1 405B, 18 CPU cores gave the best performance, so we map Llama3.1 405B with number of CPU "18" <img width="633" height="289" alt="image" src="https://github.com/user-attachments/assets/0a0bc518-d74d-4b85-907c-19b55d8ebdd4" /> <img width="590" height="286" alt="image" src="https://github.com/user-attachments/assets/0a2ad257-273d-4174-89aa-4f2ee84bbb3e" /> <img width="568" height="292" alt="image" src="https://github.com/user-attachments/assets/47ddf263-879d-477d-a13c-fb40a3162eb4" /> **Llama3.1 70B** For Llama3.1 70B, 12 CPU cores gave the best performance, so we map Llama3.1 70B with number of CPU "12" <img width="687" height="283" alt="image" src="https://github.com/user-attachments/assets/f53a17f7-cb25-4fd2-a637-d4975d0b2089" /> <img width="585" height="289" alt="image" src="https://github.com/user-attachments/assets/6465f684-5877-462b-8303-4dc526069614" /> <img width="594" height="285" alt="image" src="https://github.com/user-attachments/assets/c752b313-13f6-4f42-a20b-65d2aa54b095" /> **Why performance drop when we use more CPUs?** Here are perfspect results for #CPU=18 and #CPU=24 cases. **#CPU=18** CPU Frequency is around 2300 Hz. <img width="1041" height="559" alt="image" src="https://github.com/user-attachments/assets/0d681b47-ec30-45be-ae85-69150dcca65d" /> Gaudi utilization is around 40%. <img width="1029" height="565" alt="image" src="https://github.com/user-attachments/assets/ef5c5c2e-28af-46d5-8e1f-3be44178333f" /> **#CPU = 24** CPU frequency dropped to ~1800 Hz <img width="1035" height="549" alt="image" src="https://github.com/user-attachments/assets/c834d1f2-15d9-443b-b615-45012fecdecb" /> Gaudi utilization dropped to 30%. <img width="1022" height="564" alt="image" src="https://github.com/user-attachments/assets/ead9220c-7afd-44e9-b3b9-64528c96040d" /> Therefore, more CPU cores than needed might drop the CPU frequency and it also drop the Gaudi utilization due to low performance on CPU. --------- Signed-off-by: louie-tsai <[email protected]> Signed-off-by: Tsai, Louie <[email protected]> Co-authored-by: romir desai <[email protected]>
1 parent c6eead0 commit 2878f62

File tree

10 files changed

+452
-3
lines changed

10 files changed

+452
-3
lines changed

.cd/README.md

Lines changed: 36 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -129,7 +129,7 @@ cd vllm-gaudi/.cd/
129129
MAX_MODEL_LEN=2048 \
130130
INPUT_TOK=128 \
131131
OUTPUT_TOK=128 \
132-
CON_REQ=16 \
132+
CONCURRENT_REQ=16 \
133133
NUM_PROMPTS=64 \
134134
docker compose --profile benchmark up
135135
```
@@ -159,7 +159,41 @@ cd vllm-gaudi/.cd/
159159
> [!NOTE]
160160
> When using configuration files, you do not need to set the `MODEL` environment variable, as the model name is specified within the configuration file. However, you must still provide your `HF_TOKEN`.
161161
162-
### 7. Running the Server Directly with Docker
162+
### 7. Advance Options with pinning CPU cores for memory access coherence
163+
164+
To improve memory access cohererence and release CPUs to other CPU only workloads like a vLLM serving with Llama3 8B,
165+
pin the CPU cores based on different CPU NUMA nodes by using an auto-generate docker-compose.override.yml file.
166+
Validated Xeon Processors as for now: Intel Xeon 6960P, and Intel Xeon PLATINUM 8568Y+.
167+
168+
Couple python libraries are needed for the python scripts, so install the required packages using following commnad.
169+
170+
```bash
171+
pip install -r vllm-gaudi/.cd/server/cpu_binding/requirements_cpu_binding.txt
172+
```
173+
174+
Run below command to do CPU cores pinning via auto-generated docker-compose.override.yml file.
175+
176+
```bash
177+
export MODEL="Qwen/Qwen2.5-14B-Instruct"
178+
export HF_TOKEN="<your huggingface token>"
179+
export DOCKER_IMAGE="<docker image url>"
180+
python3 server/cpu_binding/generate_cpu_binding_from_csv.py --settings server/cpu_binding/cpu_binding_gnr.csv --output ./docker-compose.override.yml
181+
docker compose --profile benchmark up
182+
```
183+
184+
To also pin idle CPUs to another service like vllm-cpu-service, please give the service name to update
185+
docker-compose.override.yml in order to bind another service to idle cpus.
186+
Here is an exmaple to bind idle cpu for vllm-cpu-service service while docker-compose.vllm-cpu-service.yml defines cpu service.
187+
188+
```bash
189+
export MODEL="Qwen/Qwen2.5-14B-Instruct"
190+
export HF_TOKEN="<your huggingface token>"
191+
export DOCKER_IMAGE="<docker image url>"
192+
python3 server/cpu_binding/generate_cpu_binding_from_csv.py --settings server/cpu_binding/cpu_binding_gnr.csv --output ./docker-compose.override.yml --cpuservice vllm-cpu-service
193+
docker compose --profile benchmark -f docker-compose.yml -f docker-compose.vllm-cpu-service.yml -f docker-compose.override.yml up
194+
```
195+
196+
### 8. Running the Server Directly with Docker
163197

164198
For full control, you can run the server using the `docker run` command. This approach allows you to specify any native Docker parameters as needed.
165199

.cd/benchmark/benchmark_user.env

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
MODEL
22
INPUT_TOK
33
OUTPUT_TOK
4-
CON_REQ
4+
CONCURRENT_REQ
55
NUM_PROMPTS

.cd/docker-compose.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -42,4 +42,6 @@ services:
4242
- PYTHONUNBUFFERED=1
4343
env_file:
4444
- ./benchmark/benchmark_user.env
45+
volumes:
46+
- ./logs:/root/scripts/logs
4547
command: ["benchmark", "--config-file", "${VLLM_BENCHMARK_CONFIG_FILE}", "--config-name", "${VLLM_BENCHMARK_CONFIG_NAME}"]

.cd/logs/.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
*\n!/.gitignore
Lines changed: 185 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,185 @@
1+
# SPDX-License-Identifier: Apache-2.0
2+
import os
3+
import csv
4+
from importlib import util
5+
from enum import Enum
6+
from gaudi_topology import GaudiTopology
7+
8+
REQUIRED_COLUMNS = ["model_id", "input_length", "output_length", "world_size", "data_type", "num_allocated_cpu"]
9+
10+
11+
class BindingPolicy(Enum):
12+
Evenly_on_NUMAs = "evenly"
13+
NUMAs_with_cards = "close2cards"
14+
15+
16+
class CPU_Binding:
17+
18+
def __init__(self, csv_path: str = "cpu_binding_gnr.csv", use_hyperthread: bool = False):
19+
self.libnuma_found = util.find_spec("numa") is not None
20+
self.psutil_found = util.find_spec("psutil") is not None
21+
if self.libnuma_found and self.psutil_found:
22+
import psutil
23+
from numa import info
24+
# Get system Info
25+
self.cpu_count = psutil.cpu_count(logical=False)
26+
self.cpus_allow_list = psutil.Process().cpu_affinity()
27+
#print("cpu allow list:",self.cpus_allow_list)
28+
self.numa_size = info.get_num_configured_nodes()
29+
self.cpu_count_per_numa = self.cpu_count // self.numa_size
30+
31+
# Get CSV info
32+
with open(csv_path, newline="") as f:
33+
rows = list(csv.DictReader(f))
34+
if not rows or any(col not in rows[0] for col in REQUIRED_COLUMNS):
35+
found = list(rows[0].keys()) if rows else "EMPTY CSV"
36+
raise ValueError(f"CSV missing required headers {REQUIRED_COLUMNS}. Found: {found}")
37+
model = os.environ.get("MODEL")
38+
if not model:
39+
raise RuntimeError("Set environment variable MODEL to a model_id in the CSV.")
40+
input_tok = os.environ.get("INPUT_TOK")
41+
output_tok = os.environ.get("OUTPUT_TOK")
42+
con_req = os.environ.get("CONCURRENT_REQ")
43+
num_allocated_cpu = os.environ.get("NUM_CPUS")
44+
print(num_allocated_cpu)
45+
46+
row = self.pick_row_by_parameters(rows, model, input_tok, output_tok, con_req)
47+
print(row["num_allocated_cpu"])
48+
49+
self.world_size = self.parse_int(row["world_size"], "world_size")
50+
binding_policy_index = self.parse_int(row["binding_policy"], "binding_policy")
51+
self.binding_policy = list(BindingPolicy)[binding_policy_index]
52+
53+
if num_allocated_cpu:
54+
self.num_allocated_cpu = int(num_allocated_cpu)
55+
elif row["num_allocated_cpu"] == 'NA':
56+
raise RuntimeError("Invalid NUM_CPU value. Set environment variable NUM_CPUS instead .")
57+
else:
58+
self.num_allocated_cpu = self.parse_int(row["num_allocated_cpu"], "num_allocated_cpu")
59+
60+
# CPU
61+
# check allow node_to_cpus list
62+
self.node_to_cpus = []
63+
for i in range(self.numa_size):
64+
from numa import info
65+
filtered_node_to_cpus = self.filter_one_cpu_per_core(info.node_to_cpus(i))
66+
node_intersect = [cpu for cpu in filtered_node_to_cpus if cpu in self.cpus_allow_list]
67+
if bool(node_intersect):
68+
self.node_to_cpus.append(list(node_intersect))
69+
self.node_to_idle_cpus = self.node_to_cpus.copy()
70+
#self.node_to_idle_cpus_ht = [] #self.node_to_cpus
71+
for i in range(self.numa_size):
72+
if use_hyperthread is False:
73+
self.node_to_idle_cpus[i] = self.node_to_cpus[i][:self.cpu_count_per_numa]
74+
else:
75+
self.node_to_idle_cpus[i] = self.node_to_cpus[i][self.cpu_count_per_numa:]
76+
# Gaudi
77+
topo = GaudiTopology()
78+
self.cards = topo.get_cards()
79+
if self.cards is not None:
80+
self.gaudi_numa_list = []
81+
# Assume to use cards from 0 to 7
82+
for card in self.cards[:self.world_size]:
83+
if card['numa_node'] not in self.gaudi_numa_list:
84+
self.gaudi_numa_list.append(card['numa_node'])
85+
print(f"Card {card['card_id']} ({card['model']}):")
86+
print(f" Bus ID : {card['bus_id']}")
87+
print(f" NUMA Node : {card['numa_node']}")
88+
print(f" Local CPUs : {card['local_cpulist']}")
89+
90+
def parse_int(self, v: str, name: str) -> int:
91+
try:
92+
return int(v)
93+
except Exception as err:
94+
raise ValueError(f"Invalid integer for {name!r}: {v!r}") from err
95+
96+
def pick_row_by_parameters(self, rows: list[dict], model: str, input_tok: str, output_tok: str,
97+
con_req: str) -> dict:
98+
matches = [
99+
r for r in rows if r.get("model_id", "").strip() == model if r.get("input_length", "").strip() == input_tok
100+
if r.get("output_length", "").strip() == output_tok
101+
]
102+
if not matches:
103+
# fallback: match only by model_id
104+
matches = [r for r in rows if r.get('model_id', '') == model]
105+
print(f"Warning: using fallback entry for model '{model}' without exact input/output token match")
106+
if not matches:
107+
available = ", ".join(sorted({r.get('model_id', '') for r in rows}))
108+
raise ValueError(f"MODEL '{model}', input_length '{input_tok}', output_length '{output_tok}' "
109+
f"not found in CSV. Available: {available}")
110+
return matches[0]
111+
112+
def filter_one_cpu_per_core(self, cpus):
113+
"""
114+
Given a list of CPU IDs (possibly with HT pairs),
115+
return a filtered list with only one logical CPU per physical core.
116+
"""
117+
seen_cores = set()
118+
filtered = []
119+
for cpu in sorted(cpus):
120+
core_path = f"/sys/devices/system/cpu/cpu{cpu}/topology/core_id"
121+
try:
122+
with open(core_path) as f:
123+
core_id = int(f.read().strip())
124+
except FileNotFoundError:
125+
continue
126+
if core_id not in seen_cores:
127+
seen_cores.add(core_id)
128+
filtered.append(cpu)
129+
return filtered
130+
131+
def get_cpus_id_binding_based_on_numa_nodes(self, rank: int) -> str:
132+
"""Return CPUs id binding based on NUMA nodes.
133+
"""
134+
rank_to_cpus = ''
135+
if not self.libnuma_found or not self.psutil_found:
136+
print("Auto thread-binding is not supported due to "
137+
"the lack of package numa and psutil,"
138+
"fallback to no thread-binding. To get better performance,"
139+
"please try to manually bind threads.")
140+
return rank_to_cpus
141+
142+
if self.binding_policy is BindingPolicy.Evenly_on_NUMAs or self.cards is None:
143+
#divider = min(self.world_size, len(self.node_to_cpus))
144+
self.allocated_cpu_per_numa = self.num_allocated_cpu // len(self.node_to_cpus)
145+
node_id = rank
146+
elif self.binding_policy is BindingPolicy.NUMAs_with_cards:
147+
self.allocated_cpu_per_numa = self.num_allocated_cpu // len(self.gaudi_numa_list)
148+
node_id = int(self.cards[rank]['numa_node'])
149+
150+
print("binding numa node_id %d allocated_cpu_per_numa %d", node_id, self.allocated_cpu_per_numa)
151+
# Option 1. Bind to the last N cpu cores
152+
start = self.cpu_count_per_numa - self.allocated_cpu_per_numa
153+
rank_to_cpus_list = self.node_to_cpus[node_id][start:self.cpu_count_per_numa]
154+
# Option 2. Bind to the first N cpu cores
155+
#rank_to_cpus_list = self.node_to_cpus[node_id][:self.allocated_cpu_per_numa]
156+
157+
rank_to_cpus = ','.join(str(x) for x in rank_to_cpus_list)
158+
print("rank %d auto thread-binding list: %s", rank, rank_to_cpus)
159+
self.node_to_idle_cpus[node_id] = [
160+
cpu for cpu in self.node_to_idle_cpus[node_id] if cpu not in rank_to_cpus_list
161+
]
162+
return rank_to_cpus
163+
164+
165+
if __name__ == "__main__":
166+
libnuma_found = util.find_spec("numa") is not None
167+
if libnuma_found:
168+
from numa import info
169+
numa_size = info.get_num_configured_nodes()
170+
else:
171+
numa_size = 1
172+
world_size = numa_size
173+
cpu_binder = CPU_Binding(use_hyperthread=False)
174+
if cpu_binder.binding_policy is BindingPolicy.Evenly_on_NUMAs or cpu_binder.cards is None:
175+
max_needed_numa_size = len(cpu_binder.node_to_cpus)
176+
elif cpu_binder.binding_policy is BindingPolicy.NUMAs_with_cards:
177+
max_needed_numa_size = min(cpu_binder.world_size, len(cpu_binder.node_to_cpus))
178+
for i in range(max_needed_numa_size):
179+
rank_to_cpus = cpu_binder.get_cpus_id_binding_based_on_numa_nodes(i)
180+
print(rank_to_cpus)
181+
182+
rank_to_idle_cpus = ','.join(str(x) for row in cpu_binder.node_to_idle_cpus for x in row)
183+
print(rank_to_idle_cpus)
184+
for r in cpu_binder.node_to_idle_cpus:
185+
print(len(r))
Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
model_id,input_length,output_length,world_size,data_type,num_allocated_cpu,binding_policy
2+
meta-llama/Llama-3.1-405B-Instruct,128,4096,8,bf16,24,0
3+
meta-llama/Llama-3.1-405B-Instruct,2048,2048,8,bf16,24,0
4+
meta-llama/Llama-3.1-405B-Instruct,4096,128,8,bf16,24,0
5+
meta-llama/Llama-3.1-70B-Instruct,128,4096,4,bf16,24,0
6+
meta-llama/Llama-3.1-70B-Instruct,2048,2048,4,bf16,24,0
7+
meta-llama/Llama-3.1-70B-Instruct,4096,128,4,bf16,24,0
Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
model_id,input_length,output_length,world_size,data_type,num_allocated_cpu,binding_policy
2+
meta-llama/Llama-3.1-405B-Instruct,128,4096,8,bf16,18,0
3+
meta-llama/Llama-3.1-405B-Instruct,2048,2048,8,bf16,18,0
4+
meta-llama/Llama-3.1-405B-Instruct,4096,128,8,bf16,18,0
5+
meta-llama/Llama-3.1-70B-Instruct,128,4096,4,bf16,12,0
6+
meta-llama/Llama-3.1-70B-Instruct,2048,2048,4,bf16,12,0
7+
meta-llama/Llama-3.1-70B-Instruct,4096,128,4,bf16,12,0
8+
meta-llama/Llama-3.1-8B-Instruct,128,4096,1,bf16,6,0
9+
meta-llama/Llama-3.1-8B-Instruct,2048,2048,1,bf16,6,0
10+
meta-llama/Llama-3.1-8B-Instruct,4096,128,1,bf16,6,0
11+
Qwen/Qwen2.5-14B-Instruct,2048,2048,1,bf16,6,0
Lines changed: 121 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,121 @@
1+
#!/usr/bin/env python3
2+
# ==============================================================================
3+
# gaudi_topology.py
4+
# Provides GaudiTopology class:
5+
# - discover all Gaudi cards via hl-smi
6+
# - return NUMA node and CPU IDs per card
7+
# Works with hl-smi v1.22.0+ (HL-325L / Gaudi3) table format.
8+
# ==============================================================================
9+
10+
import subprocess
11+
import re
12+
import os
13+
from typing import Optional
14+
import shutil
15+
16+
17+
class GaudiTopology:
18+
"""Utility class to discover Gaudi cards and their NUMA / CPU locality."""
19+
20+
def __init__(self):
21+
self.cards = self._discover_cards()
22+
23+
# ------------------------------------------------------------------
24+
def _run_cmd(self, cmd: str) -> str:
25+
"""Run a shell command and return stdout."""
26+
try:
27+
result = subprocess.run(cmd, shell=True, check=True, capture_output=True, text=True)
28+
return result.stdout
29+
except subprocess.CalledProcessError as e:
30+
raise RuntimeError(f"Command failed: {cmd}\n{e.stderr}") from e
31+
32+
# ------------------------------------------------------------------
33+
def _parse_hl_smi_table(self, text: str) -> list[dict]:
34+
"""
35+
Parse hl-smi v1.22+ table format.
36+
Example line:
37+
| 0 HL-325L N/A | 0000:97:00.0 N/A | ...
38+
"""
39+
cards = []
40+
pattern = re.compile(r'^\|\s*(\d+)\s+([A-Z0-9-]+)\s+N/A\s+\|\s*([0-9a-fA-F:.]+)\s+N/A\s*\|')
41+
for line in text.splitlines():
42+
match = pattern.match(line)
43+
if not match:
44+
continue
45+
card_id, model, bus_id = match.groups()
46+
if not bus_id.startswith("0000:"):
47+
bus_id = "0000:" + bus_id
48+
cards.append({"card_id": int(card_id), "model": model, "bus_id": bus_id})
49+
return cards
50+
51+
# ------------------------------------------------------------------
52+
def _get_sysfs_info(self, bus_id: str) -> dict[str, Optional[str]]:
53+
"""Fetch NUMA node and local CPU list from sysfs."""
54+
sys_path = f"/sys/bus/pci/devices/{bus_id}"
55+
info = {"numa_node": None, "local_cpulist": None}
56+
try:
57+
with open(os.path.join(sys_path, "numa_node")) as f:
58+
info["numa_node"] = f.read().strip()
59+
except FileNotFoundError:
60+
pass
61+
try:
62+
with open(os.path.join(sys_path, "local_cpulist")) as f:
63+
info["local_cpulist"] = f.read().strip()
64+
except FileNotFoundError:
65+
pass
66+
return info
67+
68+
# ------------------------------------------------------------------
69+
def _discover_cards(self) -> list[dict]:
70+
"""Run hl-smi and discover Gaudi cards."""
71+
if shutil.which("hl-smi") is None:
72+
print("No hl-smi found")
73+
return None
74+
75+
hl_smi_output = self._run_cmd("hl-smi")
76+
cards = self._parse_hl_smi_table(hl_smi_output)
77+
for c in cards:
78+
sysfs_info = self._get_sysfs_info(c["bus_id"])
79+
c.update(sysfs_info)
80+
return cards
81+
82+
# ------------------------------------------------------------------
83+
def get_cards(self) -> list[dict]:
84+
"""Return list of all discovered cards sorted by NUMA node (then card_id)."""
85+
86+
def sort_key(c):
87+
# Convert numa_node to int when possible, else put N/A at the end
88+
try:
89+
return (int(c["numa_node"]), c["card_id"])
90+
except (TypeError, ValueError):
91+
return (999, c["card_id"])
92+
93+
return sorted(self.cards, key=sort_key)
94+
95+
# ------------------------------------------------------------------
96+
def get_numa_for_card(self, card_id: int) -> Optional[str]:
97+
"""Return NUMA node for a given card ID."""
98+
for c in self.cards:
99+
if c["card_id"] == card_id:
100+
return c["numa_node"]
101+
return None
102+
103+
# ------------------------------------------------------------------
104+
def get_cpus_for_card(self, card_id: int) -> Optional[str]:
105+
"""Return local CPU list for a given card ID."""
106+
for c in self.cards:
107+
if c["card_id"] == card_id:
108+
return c["local_cpulist"]
109+
return None
110+
111+
112+
# ------------------------------------------------------------------------------
113+
114+
if __name__ == "__main__":
115+
topo = GaudiTopology()
116+
for card in topo.get_cards():
117+
print(f"Card {card['card_id']} ({card['model']}):")
118+
print(f" Bus ID : {card['bus_id']}")
119+
print(f" NUMA Node : {card['numa_node']}")
120+
print(f" Local CPUs : {card['local_cpulist']}")
121+
print()

0 commit comments

Comments
 (0)