Skip to content

Conversation

@plusNew001
Copy link
Collaborator

Motivation

Modifications

Usage or Command

Accuracy Tests

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

@paddle-bot
Copy link

paddle-bot bot commented Nov 6, 2025

Thanks for your contribution!

Copilot AI review requested due to automatic review settings November 7, 2025 07:32
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements dynamic port allocation based on GPU_ID to support parallel CI test execution on different GPU sets. The changes enable running tests on either GPU set 0-3 (GPU_ID=0) or GPU set 4-7 (GPU_ID=4) with isolated port ranges to avoid conflicts.

  • Modified CI workflow to set GPU_ID based on runner name
  • Updated shell script to dynamically calculate ports based on GPU_ID
  • Modified Python test files to use GPU_ID environment variable for port calculation
  • Changed XPU device allocation from using all 8 GPUs to using 4 GPUs per test run

Reviewed Changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
.github/workflows/ci_xpu.yml Maps runner name to GPU_ID (4 for runner ending in "1", otherwise 0) and passes it as environment variable instead of individual port variables
scripts/run_ci_xpu.sh Implements dynamic port calculation (base_port = 8188 + GPU_ID * 100), configures XPU device visibility based on GPU_ID, and reduces tensor-parallel-size from 8 to 4
tests/ci_use/XPU_45T/run_w4a8.py Adds GPU_ID-based dynamic port calculation for service HTTP port
tests/ci_use/XPU_45T/run_45vl.py Adds GPU_ID-based dynamic port calculation for service HTTP port
tests/ci_use/XPU_45T/run_45T.py Adds GPU_ID-based dynamic port calculation for service HTTP port and updates base response text
tests/ci_use/XPU_45T/run_ep.py Adds GPU_ID-based dynamic port calculation for engine worker queue ports

ps -efww | grep -E 'api_server' | grep -v grep | awk '{print $2}' | xargs kill -9 || true
ps -efww | grep -E '8188' | grep -v grep | awk '{print $2}' | xargs kill -9 || true
lsof -t -i :8188 | xargs kill -9 || true
}
Copy link

Copilot AI Nov 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The stop_processes function still searches for and kills processes using hardcoded port 8188. When GPU_ID != 0, services will run on different ports (e.g., 8288 for GPU_ID=4), so these cleanup commands won't properly stop those processes.

Consider making the port pattern dynamic or using broader process matching:

function stop_processes() {
    ps -efww | grep -E 'cache_transfer_manager.py' | grep -v grep | awk '{print $2}' | xargs kill -9 || true
    ps -efww | grep -E 'api_server' | grep -v grep | awk '{print $2}' | xargs kill -9 || true
    # Kill based on GPU_ID specific port if set
    if [[ -n "$GPU_ID" ]]; then
        local port=$((8188 + GPU_ID * 100))
        ps -efww | grep -E "$port" | grep -v grep | awk '{print $2}' | xargs kill -9 || true
        lsof -t -i :$port | xargs kill -9 || true
    else
        ps -efww | grep -E '8188' | grep -v grep | awk '{print $2}' | xargs kill -9 || true
        lsof -t -i :8188 | xargs kill -9 || true
    fi
}

Copilot uses AI. Check for mistakes.
Comment on lines 65 to 66
else
gpu_id="0"
Copy link

Copilot AI Nov 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The GPU_ID assignment logic seems inconsistent with the shell script expectations. The workflow sets gpu_id="4" when the last character of the runner name is "1", but in the shell script, GPU_ID is used in arithmetic operations expecting values like 0 or 4.

This means:

  • When last_char == "1": GPU_ID=4, ports will be 8188 + 4*100 = 8588
  • When last_char != "1": GPU_ID=0, ports will be 8188 + 0*100 = 8188

However, there's a potential issue: if the runner name ends with digits 2, 3, or other values, GPU_ID will always be 0, which could cause port conflicts if multiple runners execute simultaneously.

Consider documenting this behavior or adding validation to ensure only expected runner names are used.

Suggested change
else
gpu_id="0"
elif [[ "$last_char" == "0" ]]; then
gpu_id="0"
else
echo "Error: Unexpected runner name '$runner_name'. The last character must be '0' or '1'."
exit 1

Copilot uses AI. Check for mistakes.
--port $port_num \
--engine-worker-queue-port $((port_num + 1)) \
--metrics-port $((port_num + 2)) \
--cache-queue-port $((port_num + 47873)) \
Copy link

Copilot AI Nov 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cache queue port calculation $((port_num + 47873)) could result in very large port numbers that might exceed the valid port range (0-65535).

For example:

  • When GPU_ID=0: port_num=8188, cache-queue-port = 8188 + 47873 = 56061 (valid)
  • When GPU_ID=4: port_num=8588, cache-queue-port = 8588 + 47873 = 56461 (valid)

While this works for the current GPU_ID values (0 and 4), if GPU_ID ever increases or the base port changes, this could exceed 65535. Consider using a different offset or documenting the maximum supported GPU_ID value.

Copilot uses AI. Check for mistakes.
else
export XPU_VISIBLE_DEVICES="4,5,6,7"
fi
xpu-smi -r -i $XPU_VISIBLE_DEVICES
Copy link

Copilot AI Nov 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The XPU reset command xpu-smi -r -i $XPU_VISIBLE_DEVICES uses the value of XPU_VISIBLE_DEVICES directly. However, according to the XPU-SMI documentation, the -i flag typically expects device indices (e.g., "0,1,2,3") which matches the format of XPU_VISIBLE_DEVICES. This should work correctly, but consider verifying that the reset succeeds before proceeding, as a failed reset could cause subsequent tests to fail in unexpected ways.

Consider adding error checking:

xpu-smi -r -i $XPU_VISIBLE_DEVICES || { echo "XPU reset failed"; exit 1; }
Suggested change
xpu-smi -r -i $XPU_VISIBLE_DEVICES
xpu-smi -r -i $XPU_VISIBLE_DEVICES || { echo "XPU reset failed"; exit 1; }

Copilot uses AI. Check for mistakes.
data_parallel_size = 1

engine_worker_queue_port = [str(8023 + i * 10) for i in range(data_parallel_size)]
gpu_id = int(os.getenv("GPU_ID", "0"))
Copy link

Copilot AI Nov 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The base_port calculation is consistent with the pattern used in other test files (8188 for GPU_ID=0). However, this differs from the base port 8023 used before the change. Consider documenting why different tests use different base ports:

  • run_45T.py, run_w4a8.py, run_45vl.py: base 8188
  • run_ep.py: base 8023

This could be intentional for avoiding port conflicts between different test types, but it should be documented to prevent confusion.

Suggested change
gpu_id = int(os.getenv("GPU_ID", "0"))
gpu_id = int(os.getenv("GPU_ID", "0"))
# Note: This test uses base_port=8023 (vs. 8188 in run_45T.py, run_w4a8.py, run_45vl.py).
# This is intentional to avoid port conflicts between different test types.
# If you modify the base port here, ensure it does not overlap with other test scripts.

Copilot uses AI. Check for mistakes.
@plusNew001 plusNew001 closed this Nov 7, 2025
@plusNew001 plusNew001 deleted the ci-change-multi branch November 7, 2025 09:42
@plusNew001 plusNew001 restored the ci-change-multi branch November 7, 2025 09:42
@plusNew001 plusNew001 reopened this Nov 7, 2025
@plusNew001 plusNew001 merged commit 3665c28 into PaddlePaddle:develop Nov 10, 2025
13 of 15 checks passed
@plusNew001 plusNew001 deleted the ci-change-multi branch November 13, 2025 11:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants