-
Notifications
You must be signed in to change notification settings - Fork 660
[XPU] [CI]Change CI to multi-concurrency #4866
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
c6b557d
fbf039d
1564fa1
c97ddfe
a535fb3
114b714
72a2f1c
a62f920
4e5b37a
0481a88
2ea080d
f238458
5e65c5e
97b44a0
ec60750
e3c1219
aa69ac5
47fcd10
949a928
d5ca67a
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
|
|
@@ -9,13 +9,27 @@ apt install -y lsof | |||||
| function stop_processes() { | ||||||
| ps -efww | grep -E 'cache_transfer_manager.py' | grep -v grep | awk '{print $2}' | xargs kill -9 || true | ||||||
| ps -efww | grep -E 'api_server' | grep -v grep | awk '{print $2}' | xargs kill -9 || true | ||||||
| ps -efww | grep -E '8188' | grep -v grep | awk '{print $2}' | xargs kill -9 || true | ||||||
| lsof -t -i :8188 | xargs kill -9 || true | ||||||
| ps -efww | grep -E "$((8188 + GPU_ID * 100))" | grep -v grep | awk '{print $2}' | xargs kill -9 || true | ||||||
| lsof -t -i :$((8188 + GPU_ID * 100)) | xargs kill -9 || true | ||||||
| } | ||||||
|
||||||
| stop_processes | ||||||
|
|
||||||
| #设置模型路径 | ||||||
| export model_path=${MODEL_PATH}/ERNIE-4.5-300B-A47B-Paddle | ||||||
| # 由于机器原因,需重启使用的卡,以保障没有问题 | ||||||
| if [[ "$GPU_ID" == "0" ]]; then | ||||||
| export XPU_VISIBLE_DEVICES="0,1,2,3" | ||||||
| else | ||||||
| export XPU_VISIBLE_DEVICES="4,5,6,7" | ||||||
| fi | ||||||
|
|
||||||
| mkdir -p /workspace/deps | ||||||
| cd /workspace/deps | ||||||
| wget -q https://klx-sdk-release-public.su.bcebos.com/xre/kl3-release/5.0.21.21/xre-Linux-x86_64-5.0.21.21.tar.gz | ||||||
| tar -zxf xre-Linux-x86_64-5.0.21.21.tar.gz && mv xre-Linux-x86_64-5.0.21.21 xre | ||||||
| cd - | ||||||
| export PATH=/workspace/deps/xre/bin:$PATH | ||||||
|
|
||||||
| xpu-smi -r -i $XPU_VISIBLE_DEVICES | ||||||
|
||||||
| xpu-smi -r -i $XPU_VISIBLE_DEVICES | |
| xpu-smi -r -i $XPU_VISIBLE_DEVICES || { echo "XPU reset failed"; exit 1; } |
Copilot
AI
Nov 7, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cache queue port calculation $((port_num + 47873)) could result in very large port numbers that might exceed the valid port range (0-65535).
For example:
- When GPU_ID=0: port_num=8188, cache-queue-port = 8188 + 47873 = 56061 (valid)
- When GPU_ID=4: port_num=8588, cache-queue-port = 8588 + 47873 = 56461 (valid)
While this works for the current GPU_ID values (0 and 4), if GPU_ID ever increases or the base port changes, this could exceed 65535. Consider using a different offset or documenting the maximum supported GPU_ID value.
| Original file line number | Diff line number | Diff line change | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
@@ -37,8 +37,9 @@ def test_fd_ep(): | |||||||||||
| else: | ||||||||||||
| tensor_parallel_size = xpu_device_num | ||||||||||||
| data_parallel_size = 1 | ||||||||||||
|
|
||||||||||||
| engine_worker_queue_port = [str(8023 + i * 10) for i in range(data_parallel_size)] | ||||||||||||
| gpu_id = int(os.getenv("GPU_ID", "0")) | ||||||||||||
|
||||||||||||
| gpu_id = int(os.getenv("GPU_ID", "0")) | |
| gpu_id = int(os.getenv("GPU_ID", "0")) | |
| # Note: This test uses base_port=8023 (vs. 8188 in run_45T.py, run_w4a8.py, run_45vl.py). | |
| # This is intentional to avoid port conflicts between different test types. | |
| # If you modify the base port here, ensure it does not overlap with other test scripts. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The GPU_ID assignment logic seems inconsistent with the shell script expectations. The workflow sets
gpu_id="4"when the last character of the runner name is "1", but in the shell script, GPU_ID is used in arithmetic operations expecting values like 0 or 4.This means:
last_char == "1": GPU_ID=4, ports will be 8188 + 4*100 = 8588last_char != "1": GPU_ID=0, ports will be 8188 + 0*100 = 8188However, there's a potential issue: if the runner name ends with digits 2, 3, or other values, GPU_ID will always be 0, which could cause port conflicts if multiple runners execute simultaneously.
Consider documenting this behavior or adding validation to ensure only expected runner names are used.