-
Notifications
You must be signed in to change notification settings - Fork 660
[XPU] [CI]Change CI to multi-concurrency #4866
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Updated GPU ID assignment logic and removed unused port calculations.
|
Thanks for your contribution! |
Added import for os module to use environment variables.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR implements dynamic port allocation based on GPU_ID to support parallel CI test execution on different GPU sets. The changes enable running tests on either GPU set 0-3 (GPU_ID=0) or GPU set 4-7 (GPU_ID=4) with isolated port ranges to avoid conflicts.
- Modified CI workflow to set GPU_ID based on runner name
- Updated shell script to dynamically calculate ports based on GPU_ID
- Modified Python test files to use GPU_ID environment variable for port calculation
- Changed XPU device allocation from using all 8 GPUs to using 4 GPUs per test run
Reviewed Changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| .github/workflows/ci_xpu.yml | Maps runner name to GPU_ID (4 for runner ending in "1", otherwise 0) and passes it as environment variable instead of individual port variables |
| scripts/run_ci_xpu.sh | Implements dynamic port calculation (base_port = 8188 + GPU_ID * 100), configures XPU device visibility based on GPU_ID, and reduces tensor-parallel-size from 8 to 4 |
| tests/ci_use/XPU_45T/run_w4a8.py | Adds GPU_ID-based dynamic port calculation for service HTTP port |
| tests/ci_use/XPU_45T/run_45vl.py | Adds GPU_ID-based dynamic port calculation for service HTTP port |
| tests/ci_use/XPU_45T/run_45T.py | Adds GPU_ID-based dynamic port calculation for service HTTP port and updates base response text |
| tests/ci_use/XPU_45T/run_ep.py | Adds GPU_ID-based dynamic port calculation for engine worker queue ports |
| ps -efww | grep -E 'api_server' | grep -v grep | awk '{print $2}' | xargs kill -9 || true | ||
| ps -efww | grep -E '8188' | grep -v grep | awk '{print $2}' | xargs kill -9 || true | ||
| lsof -t -i :8188 | xargs kill -9 || true | ||
| } |
Copilot
AI
Nov 7, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The stop_processes function still searches for and kills processes using hardcoded port 8188. When GPU_ID != 0, services will run on different ports (e.g., 8288 for GPU_ID=4), so these cleanup commands won't properly stop those processes.
Consider making the port pattern dynamic or using broader process matching:
function stop_processes() {
ps -efww | grep -E 'cache_transfer_manager.py' | grep -v grep | awk '{print $2}' | xargs kill -9 || true
ps -efww | grep -E 'api_server' | grep -v grep | awk '{print $2}' | xargs kill -9 || true
# Kill based on GPU_ID specific port if set
if [[ -n "$GPU_ID" ]]; then
local port=$((8188 + GPU_ID * 100))
ps -efww | grep -E "$port" | grep -v grep | awk '{print $2}' | xargs kill -9 || true
lsof -t -i :$port | xargs kill -9 || true
else
ps -efww | grep -E '8188' | grep -v grep | awk '{print $2}' | xargs kill -9 || true
lsof -t -i :8188 | xargs kill -9 || true
fi
}| else | ||
| gpu_id="0" |
Copilot
AI
Nov 7, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The GPU_ID assignment logic seems inconsistent with the shell script expectations. The workflow sets gpu_id="4" when the last character of the runner name is "1", but in the shell script, GPU_ID is used in arithmetic operations expecting values like 0 or 4.
This means:
- When
last_char == "1": GPU_ID=4, ports will be 8188 + 4*100 = 8588 - When
last_char != "1": GPU_ID=0, ports will be 8188 + 0*100 = 8188
However, there's a potential issue: if the runner name ends with digits 2, 3, or other values, GPU_ID will always be 0, which could cause port conflicts if multiple runners execute simultaneously.
Consider documenting this behavior or adding validation to ensure only expected runner names are used.
| else | |
| gpu_id="0" | |
| elif [[ "$last_char" == "0" ]]; then | |
| gpu_id="0" | |
| else | |
| echo "Error: Unexpected runner name '$runner_name'. The last character must be '0' or '1'." | |
| exit 1 |
| --port $port_num \ | ||
| --engine-worker-queue-port $((port_num + 1)) \ | ||
| --metrics-port $((port_num + 2)) \ | ||
| --cache-queue-port $((port_num + 47873)) \ |
Copilot
AI
Nov 7, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cache queue port calculation $((port_num + 47873)) could result in very large port numbers that might exceed the valid port range (0-65535).
For example:
- When GPU_ID=0: port_num=8188, cache-queue-port = 8188 + 47873 = 56061 (valid)
- When GPU_ID=4: port_num=8588, cache-queue-port = 8588 + 47873 = 56461 (valid)
While this works for the current GPU_ID values (0 and 4), if GPU_ID ever increases or the base port changes, this could exceed 65535. Consider using a different offset or documenting the maximum supported GPU_ID value.
| else | ||
| export XPU_VISIBLE_DEVICES="4,5,6,7" | ||
| fi | ||
| xpu-smi -r -i $XPU_VISIBLE_DEVICES |
Copilot
AI
Nov 7, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The XPU reset command xpu-smi -r -i $XPU_VISIBLE_DEVICES uses the value of XPU_VISIBLE_DEVICES directly. However, according to the XPU-SMI documentation, the -i flag typically expects device indices (e.g., "0,1,2,3") which matches the format of XPU_VISIBLE_DEVICES. This should work correctly, but consider verifying that the reset succeeds before proceeding, as a failed reset could cause subsequent tests to fail in unexpected ways.
Consider adding error checking:
xpu-smi -r -i $XPU_VISIBLE_DEVICES || { echo "XPU reset failed"; exit 1; }| xpu-smi -r -i $XPU_VISIBLE_DEVICES | |
| xpu-smi -r -i $XPU_VISIBLE_DEVICES || { echo "XPU reset failed"; exit 1; } |
| data_parallel_size = 1 | ||
|
|
||
| engine_worker_queue_port = [str(8023 + i * 10) for i in range(data_parallel_size)] | ||
| gpu_id = int(os.getenv("GPU_ID", "0")) |
Copilot
AI
Nov 7, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The base_port calculation is consistent with the pattern used in other test files (8188 for GPU_ID=0). However, this differs from the base port 8023 used before the change. Consider documenting why different tests use different base ports:
- run_45T.py, run_w4a8.py, run_45vl.py: base 8188
- run_ep.py: base 8023
This could be intentional for avoiding port conflicts between different test types, but it should be documented to prevent confusion.
| gpu_id = int(os.getenv("GPU_ID", "0")) | |
| gpu_id = int(os.getenv("GPU_ID", "0")) | |
| # Note: This test uses base_port=8023 (vs. 8188 in run_45T.py, run_w4a8.py, run_45vl.py). | |
| # This is intentional to avoid port conflicts between different test types. | |
| # If you modify the base port here, ensure it does not overlap with other test scripts. |
Motivation
Modifications
Usage or Command
Accuracy Tests
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.