A prefill & decode disaggregated LLM serving framework with shared GPU memory and fine-grained compute isolation.
If you use Semi-PD for your research, please cite our paper:
@misc{hong2025semipd,
title={semi-PD: Towards Efficient LLM Serving via Phase-Wise Disaggregated Computation and Unified Storage},
author={Ke Hong, Lufang Chen, Zhong Wang, Xiuhong Li, Qiuli Mao, Jianping Ma, Chao Xiong, Guanyu Wu, Buhe Han, Guohao Dai, Yun Liang, Yu Wang},
year={2025},
eprint={2504.19867},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
This repository originally started as a fork of the SGLang project. Semi-PD is a research prototype and does not have complete feature parity with open-source SGLang. We have only retained the most critical features and adopted the codebase for faster research iterations.
# setup the semi-pd conda environment
conda create -n semi_pd -y python=3.11
conda activate semi_pd
# Use the last release branch
git clone [email protected]:infinigence/Semi-PD.git
cd Semi-PD
pip install --upgrade pip
# build IPC dependency
cd ./semi-pd-ipc/
pip install -e .
# build Semi-PD
cd ..
pip install -e "python[all]" --find-links https://flashinfer.ai/whl/cu124/torch2.5/flashinfer-python
cd ../sgl-kernel
python setup_rocm.py install
cd ..
pip install -e "python[all_hip]"
You can follow the following steps to build the base environment, or build from Dockerfile.
docker pull lmsysorg/sglang:v0.4.4.post1-cu124
docker run -it --gpus all -p 30000:30000 -v /your/path:/your/path --ipc=host --name semi_pd v0.4.4.post1-cu124:latest
docker exec -it semi_pd bash
docker pull lmsysorg/sglang:v0.4.4.post1-rocm630
docker run -it --device=/dev/kfd --device=/dev/dri --shm-size=32g -p 30000:30000 -v /your/path:/your/path --ipc=host --name semi_pd v0.4.4.post1-rocm630:latest
docker exec -it semi_pd bash
Then you can follow the Build && Install
section to build Semi-PD.
The implementation of compute isolation is based on Multi-Process Service (MPS). For NVIDIA GPUs, the MPS service must be manually enabled, whereas on AMD GPUs, it is enabled by default.
export CUDA_MPS_ENABLE_PER_CTX_DEVICE_MULTIPROCESSOR_PARTITIONING=1
nvidia-cuda-mps-control -d
You can disable MPS service by using this cmd:
echo quit | sudo nvidia-cuda-mps-control
Semi-PD can be enabled using the --enable-semi-pd
flag. Additionally, our implementation does not share activations between the prefill and decode phases, which may result in slightly higher memory usage compared to the original SGLang. If an out-of-memory issue occurs, consider reducing the value of --mem-fraction-static
to mitigate memory pressure.
python3 -m sglang.launch_server \
--model-path $MODEL_PATH --served-model-name $MODEL_NAME \
--host 0.0.0.0 --port $SERVE_PORT --trust-remote-code --disable-radix-cache \
--enable-semi-pd --mem-fraction-static 0.85 --tp $TP_SIZE
Please refer to the evaluation directory.