[XPU] [CI]Change CI to multi-concurrency #4866

plusNew001 · 2025-11-06T11:44:25Z

Motivation

Modifications

Usage or Command

Accuracy Tests

Checklist

Add at least a tag in the PR title.
- Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
- You can add new tags based on the PR content, but the semantics must be clear.
Format your code, run pre-commit before commit.
Add unit tests. Please write the reason in this PR if no unit tests.
Provide accuracy results.
If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

Updated GPU ID assignment logic and removed unused port calculations.

paddle-bot · 2025-11-06T11:44:33Z

Thanks for your contribution!

Added import for os module to use environment variables.

Copilot

Pull Request Overview

This PR implements dynamic port allocation based on GPU_ID to support parallel CI test execution on different GPU sets. The changes enable running tests on either GPU set 0-3 (GPU_ID=0) or GPU set 4-7 (GPU_ID=4) with isolated port ranges to avoid conflicts.

Modified CI workflow to set GPU_ID based on runner name
Updated shell script to dynamically calculate ports based on GPU_ID
Modified Python test files to use GPU_ID environment variable for port calculation
Changed XPU device allocation from using all 8 GPUs to using 4 GPUs per test run

Reviewed Changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
.github/workflows/ci_xpu.yml	Maps runner name to GPU_ID (4 for runner ending in "1", otherwise 0) and passes it as environment variable instead of individual port variables
scripts/run_ci_xpu.sh	Implements dynamic port calculation (base_port = 8188 + GPU_ID * 100), configures XPU device visibility based on GPU_ID, and reduces tensor-parallel-size from 8 to 4
tests/ci_use/XPU_45T/run_w4a8.py	Adds GPU_ID-based dynamic port calculation for service HTTP port
tests/ci_use/XPU_45T/run_45vl.py	Adds GPU_ID-based dynamic port calculation for service HTTP port
tests/ci_use/XPU_45T/run_45T.py	Adds GPU_ID-based dynamic port calculation for service HTTP port and updates base response text
tests/ci_use/XPU_45T/run_ep.py	Adds GPU_ID-based dynamic port calculation for engine worker queue ports

Copilot · 2025-11-07T07:36:29Z

scripts/run_ci_xpu.sh

    ps -efww | grep -E 'api_server' | grep -v grep | awk '{print $2}' | xargs kill -9 || true
    ps -efww | grep -E '8188' | grep -v grep | awk '{print $2}' | xargs kill -9 || true
    lsof -t -i :8188 | xargs kill -9 || true
 }


The stop_processes function still searches for and kills processes using hardcoded port 8188. When GPU_ID != 0, services will run on different ports (e.g., 8288 for GPU_ID=4), so these cleanup commands won't properly stop those processes.

Consider making the port pattern dynamic or using broader process matching:

function stop_processes() { ps -efww | grep -E 'cache_transfer_manager.py' | grep -v grep | awk '{print $2}' | xargs kill -9 || true ps -efww | grep -E 'api_server' | grep -v grep | awk '{print $2}' | xargs kill -9 || true # Kill based on GPU_ID specific port if set if [[ -n "$GPU_ID" ]]; then local port=$((8188 + GPU_ID * 100)) ps -efww | grep -E "$port" | grep -v grep | awk '{print $2}' | xargs kill -9 || true lsof -t -i :$port | xargs kill -9 || true else ps -efww | grep -E '8188' | grep -v grep | awk '{print $2}' | xargs kill -9 || true lsof -t -i :8188 | xargs kill -9 || true fi }

Copilot · 2025-11-07T07:36:30Z

.github/workflows/ci_xpu.yml

          else
            gpu_id="0"


The GPU_ID assignment logic seems inconsistent with the shell script expectations. The workflow sets gpu_id="4" when the last character of the runner name is "1", but in the shell script, GPU_ID is used in arithmetic operations expecting values like 0 or 4.

This means:

When last_char == "1": GPU_ID=4, ports will be 8188 + 4*100 = 8588

When last_char != "1": GPU_ID=0, ports will be 8188 + 0*100 = 8188

However, there's a potential issue: if the runner name ends with digits 2, 3, or other values, GPU_ID will always be 0, which could cause port conflicts if multiple runners execute simultaneously.

Consider documenting this behavior or adding validation to ensure only expected runner names are used.

Suggested change

else

gpu_id="0"

elif [[ "$last_char" == "0" ]]; then

gpu_id="0"

else

echo "Error: Unexpected runner name '$runner_name'. The last character must be '0' or '1'."

exit 1

Copilot · 2025-11-07T07:36:30Z

scripts/run_ci_xpu.sh

+    --port $port_num \
+    --engine-worker-queue-port $((port_num + 1)) \
+    --metrics-port $((port_num + 2)) \
+    --cache-queue-port $((port_num + 47873)) \


The cache queue port calculation $((port_num + 47873)) could result in very large port numbers that might exceed the valid port range (0-65535).

For example:

When GPU_ID=0: port_num=8188, cache-queue-port = 8188 + 47873 = 56061 (valid)

When GPU_ID=4: port_num=8588, cache-queue-port = 8588 + 47873 = 56461 (valid)

While this works for the current GPU_ID values (0 and 4), if GPU_ID ever increases or the base port changes, this could exceed 65535. Consider using a different offset or documenting the maximum supported GPU_ID value.

Copilot · 2025-11-07T07:36:30Z

scripts/run_ci_xpu.sh

+else
+    export XPU_VISIBLE_DEVICES="4,5,6,7"
+fi
+xpu-smi -r -i $XPU_VISIBLE_DEVICES


The XPU reset command xpu-smi -r -i $XPU_VISIBLE_DEVICES uses the value of XPU_VISIBLE_DEVICES directly. However, according to the XPU-SMI documentation, the -i flag typically expects device indices (e.g., "0,1,2,3") which matches the format of XPU_VISIBLE_DEVICES. This should work correctly, but consider verifying that the reset succeeds before proceeding, as a failed reset could cause subsequent tests to fail in unexpected ways.

Consider adding error checking:

xpu-smi -r -i $XPU_VISIBLE_DEVICES || { echo "XPU reset failed"; exit 1; }

Suggested change

xpu-smi -r -i $XPU_VISIBLE_DEVICES

xpu-smi -r -i $XPU_VISIBLE_DEVICES || { echo "XPU reset failed"; exit 1; }

Copilot · 2025-11-07T07:36:31Z

tests/ci_use/XPU_45T/run_ep.py

        data_parallel_size = 1
-
-    engine_worker_queue_port = [str(8023 + i * 10) for i in range(data_parallel_size)]
+    gpu_id = int(os.getenv("GPU_ID", "0"))


The base_port calculation is consistent with the pattern used in other test files (8188 for GPU_ID=0). However, this differs from the base port 8023 used before the change. Consider documenting why different tests use different base ports:

run_45T.py, run_w4a8.py, run_45vl.py: base 8188

run_ep.py: base 8023

This could be intentional for avoiding port conflicts between different test types, but it should be documented to prevent confusion.

Suggested change

gpu_id = int(os.getenv("GPU_ID", "0"))

gpu_id = int(os.getenv("GPU_ID", "0"))

# Note: This test uses base_port=8023 (vs. 8188 in run_45T.py, run_w4a8.py, run_45vl.py).

# This is intentional to avoid port conflicts between different test types.

# If you modify the base port here, ensure it does not overlap with other test scripts.

plusNew001 added 7 commits November 6, 2025 15:43

Refactor GPU ID logic in CI workflow

c6b557d

Updated GPU ID assignment logic and removed unused port calculations.

Refactor GPU device and port configuration

fbf039d

Update engine_worker_queue_port calculation logic

1564fa1

Refactor XPU_VISIBLE_DEVICES export logic

c97ddfe

Adjust service port based on GPU ID

a535fb3

Adjust service HTTP port based on GPU ID

114b714

Adjust service_http_port based on GPU_ID

72a2f1c

plusNew001 and others added 8 commits November 6, 2025 19:51

Add import for os module in run_45T.py

a62f920

Update run_45vl.py

4e5b37a

Import os module in run_w4a8.py

0481a88

Added import for os module to use environment variables.

Remove duplicate import of os module

2ea080d

Remove duplicate import of os module

f238458

Update run_45T.py

5e65c5e

Update run_w4a8.py

97b44a0

fix bug

ec60750

Copilot AI review requested due to automatic review settings November 7, 2025 07:32

Copilot AI reviewed Nov 7, 2025

View reviewed changes

fix bug

e3c1219

plusNew001 closed this Nov 7, 2025

plusNew001 deleted the ci-change-multi branch November 7, 2025 09:42

plusNew001 restored the ci-change-multi branch November 7, 2025 09:42

plusNew001 reopened this Nov 7, 2025

plusNew001 added 4 commits November 7, 2025 18:39

Merge branch 'develop' into ci-change-multi

aa69ac5

Update run_w4a8.py

47fcd10

Fix directory change command in run_ci_xpu.sh

949a928

Merge branch 'develop' into ci-change-multi

d5ca67a

EmmonsCurse approved these changes Nov 10, 2025

View reviewed changes

plusNew001 merged commit 3665c28 into PaddlePaddle:develop Nov 10, 2025
13 of 15 checks passed

plusNew001 deleted the ci-change-multi branch November 13, 2025 11:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[XPU] [CI]Change CI to multi-concurrency #4866

[XPU] [CI]Change CI to multi-concurrency #4866

Uh oh!

plusNew001 commented Nov 6, 2025

Uh oh!

paddle-bot bot commented Nov 6, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Nov 7, 2025

Uh oh!

Copilot AI Nov 7, 2025

Uh oh!

Copilot AI Nov 7, 2025

Uh oh!

Copilot AI Nov 7, 2025

Uh oh!

Copilot AI Nov 7, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	xpu-smi -r -i $XPU_VISIBLE_DEVICES
	xpu-smi -r -i $XPU_VISIBLE_DEVICES \|\| { echo "XPU reset failed"; exit 1; }

-    gpu_id = int(os.getenv("GPU_ID", "0"))
+    gpu_id = int(os.getenv("GPU_ID", "0"))
+    # Note: This test uses base_port=8023 (vs. 8188 in run_45T.py, run_w4a8.py, run_45vl.py).
+    # This is intentional to avoid port conflicts between different test types.
+    # If you modify the base port here, ensure it does not overlap with other test scripts.

[XPU] [CI]Change CI to multi-concurrency #4866

[XPU] [CI]Change CI to multi-concurrency #4866

Uh oh!

Conversation

plusNew001 commented Nov 6, 2025

Motivation

Modifications

Usage or Command

Accuracy Tests

Checklist

Uh oh!

paddle-bot bot commented Nov 6, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants