feat: Low Precision Allreduce for PCIe based GPU #3851

kanghui0204 · 2025-04-25T04:16:55Z

Low Precision Allreduce for PCIe based GPU

[feat] Support a new feature for low precision allreduce for PCIe based GPU

Description

This PR adds a customized allreduce to TensorRT-LLM. The new allreduce is used for communication on PCIe-based GPUs via low-precision quantization, which can accelerate the PCIe allreduce process.

Test Coverage

I added a unit test for this feature: tests/unittest/_torch/multi_gpu/test_lowprecision_allreduce.py. I hope this test can run on L40 or L20 nodes; otherwise, the low-precision allreduce will fall back to NCCL, making the test meaningless and causing it to fail.

cpp/tensorrt_llm/thop/allreduceOp.cpp

tensorrt_llm/_torch/distributed/ops.py

cpp/tensorrt_llm/thop/allreduceOp.cpp

juney-nvidia · 2025-04-26T02:10:32Z

@dongxuy04 @yuxianq Hi Dongxu, Yuxian, can you help review this MR?

Thanks
June

cpp/tensorrt_llm/thop/thUtils.cpp

cpp/tensorrt_llm/kernels/customLowPrecisionAllReduceKernels.cu

cpp/tensorrt_llm/kernels/customLowPrecisionAllReduceKernels.h

tensorrt_llm/plugin/plugin.py

cpp/tensorrt_llm/thop/allreduceOp.cpp

docs/source/advanced/lowprecision-pcie-allreduce.md

tensorrt_llm/_torch/distributed/ops.py

cpp/tensorrt_llm/thop/allreduceOp.cpp

tensorrt_llm/_torch/distributed/ops.py

cpp/tensorrt_llm/thop/allreduceOp.cpp

cpp/tensorrt_llm/kernels/CMakeLists.txt

cpp/tensorrt_llm/kernels/customLowPrecisionAllReduceKernels.cu

tensorrt_llm/plugin/plugin.py

kanghui0204 · 2025-05-09T05:17:16Z

/bot run --disable-fail-fast --add-multi-gpu-test

tensorrt_llm/_torch/distributed/ops.py

cpp/tensorrt_llm/thop/allreduceOp.cpp

hyukn

Just some nits. LGTM.

kanghui0204 · 2025-05-09T07:00:33Z

/bot run --disable-fail-fast --add-multi-gpu-test

EmmaQiaoCh · 2025-05-09T07:38:57Z

/bot run --disable-fail-fast --add-multi-gpu-test

tensorrt-cicd · 2025-05-09T07:44:20Z

PR_Github #4678 [ run ] triggered by Bot

EmmaQiaoCh · 2025-05-13T01:59:26Z

/bot run --stage-list "H100_PCIe-PyTorch-1"

tensorrt-cicd · 2025-05-13T02:05:26Z

PR_Github #4929 [ run ] triggered by Bot

EmmaQiaoCh · 2025-05-13T02:44:58Z

/bot run --disable-fail-fast --add-multi-gpu-test

tensorrt-cicd · 2025-05-13T02:50:51Z

PR_Github #4939 [ run ] triggered by Bot

tensorrt-cicd · 2025-05-13T02:50:53Z

PR_Github #4929 [ run ] completed with state ABORTED

tensorrt-cicd · 2025-05-13T07:12:18Z

PR_Github #4939 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #3582 completed with status: 'FAILURE'

hyukn · 2025-05-13T07:51:16Z

/bot run --disable-fail-fast --only-multi-gpu-test

tensorrt-cicd · 2025-05-13T07:57:02Z

PR_Github #4980 [ run ] triggered by Bot

tensorrt-cicd · 2025-05-13T14:34:50Z

PR_Github #4980 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #3617 (Partly Tested) completed with status: 'FAILURE'

hyukn · 2025-05-14T01:22:32Z

/bot run --disable-fail-fast --add-multi-gpu-test

tensorrt-cicd · 2025-05-14T01:28:55Z

PR_Github #5072 [ run ] triggered by Bot

tensorrt-cicd · 2025-05-14T05:10:36Z

PR_Github #5072 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #3693 completed with status: 'FAILURE'

EmmaQiaoCh · 2025-05-14T05:13:16Z

/bot run --stage-list "B200_PCIe-PackageSanityCheck-DLFW"

tensorrt-cicd · 2025-05-14T05:24:18Z

PR_Github #5112 [ run ] triggered by Bot

tensorrt-cicd · 2025-05-14T08:25:20Z

PR_Github #5112 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #3724 (Partly Tested) completed with status: 'SUCCESS'

Signed-off-by: Hui Kang <[email protected]>

hyukn · 2025-05-14T08:28:39Z

/bot reuse-pipeline

tensorrt-cicd · 2025-05-14T08:34:58Z

PR_Github #5150 [ reuse-pipeline ] triggered by Bot

tensorrt-cicd · 2025-05-14T08:44:52Z

PR_Github #5150 [ reuse-pipeline ] completed with state SUCCESS
Reusing PR_Github #5112 (Partly Tested) for commit 8614f2c

This reverts commit 5e634dd.

QiJune · 2025-05-15T01:35:51Z

Hi @kanghui0204 @hyukn , we found that multi-gpu test pipelines are not triggered as expected in this PR due to /bot reuse-pipeline. At the same time, the post merge multi-gpu test pipelines hangs. So, we revert this PR to fix broken CI first. Feel free to submit a new PR, and trigger multi-gpu test pipelines by /bot run --only-multi-gpu-test

Revert "feat: Low Precision Allreduce for PCIe based GPU (#3851)" This reverts commit 5e634dd.

hyukn reviewed Apr 25, 2025

View reviewed changes

cpp/tensorrt_llm/thop/allreduceOp.cpp Show resolved Hide resolved

cpp/tensorrt_llm/thop/allreduceOp.cpp Outdated Show resolved Hide resolved

tensorrt_llm/_torch/distributed/ops.py Outdated Show resolved Hide resolved

cpp/tensorrt_llm/thop/allreduceOp.cpp Outdated Show resolved Hide resolved

kanghui0204 force-pushed the low_precision_allreduce_for_pcie branch 3 times, most recently from 86140b5 to 4613735 Compare April 25, 2025 08:51

juney-nvidia requested review from dongxuy04 and yuxianq April 26, 2025 02:09

juney-nvidia changed the title ~~FEATURE:Low Precision Allreduce for PCIe based GPU~~ feat:Low Precision Allreduce for PCIe based GPU Apr 26, 2025

dongxuy04 reviewed Apr 27, 2025

View reviewed changes

kanghui0204 force-pushed the low_precision_allreduce_for_pcie branch 3 times, most recently from 3ab2d21 to ae1d6d4 Compare April 28, 2025 10:06

hlu1 reviewed Apr 29, 2025

View reviewed changes

kanghui0204 force-pushed the low_precision_allreduce_for_pcie branch 2 times, most recently from 38ddfa1 to dd8ee0d Compare May 3, 2025 14:56

hyukn requested a review from yizhang-nv May 6, 2025 03:14

kanghui0204 force-pushed the low_precision_allreduce_for_pcie branch 2 times, most recently from c0f6e3a to 09bcfdc Compare May 6, 2025 14:44

hlu1 reviewed May 7, 2025

View reviewed changes

yizhang-nv reviewed May 7, 2025

View reviewed changes

cpp/tensorrt_llm/kernels/CMakeLists.txt Outdated Show resolved Hide resolved

cpp/tensorrt_llm/kernels/customLowPrecisionAllReduceKernels.cu Outdated Show resolved Hide resolved

tensorrt_llm/plugin/plugin.py Outdated Show resolved Hide resolved

kanghui0204 force-pushed the low_precision_allreduce_for_pcie branch 2 times, most recently from a4de314 to 7816b41 Compare May 9, 2025 04:26

hyukn reviewed May 9, 2025

View reviewed changes

tensorrt_llm/_torch/distributed/ops.py Show resolved Hide resolved

cpp/tensorrt_llm/thop/allreduceOp.cpp Outdated Show resolved Hide resolved

hyukn approved these changes May 9, 2025

View reviewed changes

hyukn changed the title ~~feat:Low Precision Allreduce for PCIe based GPU~~ feat: Low Precision Allreduce for PCIe based GPU May 9, 2025

kanghui0204 force-pushed the low_precision_allreduce_for_pcie branch from 7816b41 to 222c4a9 Compare May 9, 2025 06:23

kanghui0204 force-pushed the low_precision_allreduce_for_pcie branch from 60d0eec to 9a1e741 Compare May 13, 2025 01:20

kanghui0204 force-pushed the low_precision_allreduce_for_pcie branch from 9a1e741 to 4f2c648 Compare May 14, 2025 01:19

hyukn enabled auto-merge (squash) May 14, 2025 08:27

low precision allreduce

8614f2c

Signed-off-by: Hui Kang <[email protected]>

hyukn force-pushed the low_precision_allreduce_for_pcie branch from 4f2c648 to 8614f2c Compare May 14, 2025 08:27

hyukn disabled auto-merge May 14, 2025 08:31

hyukn merged commit 5e634dd into NVIDIA:main May 14, 2025
2 checks passed

QiJune added a commit that referenced this pull request May 15, 2025

Revert "feat: Low Precision Allreduce for PCIe based GPU (#3851)"

8a7b5ef

This reverts commit 5e634dd.

QiJune mentioned this pull request May 15, 2025

Revert "feat: Low Precision Allreduce for PCIe based GPU" #4340

Merged

QiJune added a commit that referenced this pull request May 15, 2025

Revert "feat: Low Precision Allreduce for PCIe based GPU" (#4340)

498ce8a

Revert "feat: Low Precision Allreduce for PCIe based GPU (#3851)" This reverts commit 5e634dd.

kanghui0204 mentioned this pull request May 15, 2025

feat: Low Precision Allreduce for PCIe based GPU #4344

Merged

feat: Low Precision Allreduce for PCIe based GPU #3851

feat: Low Precision Allreduce for PCIe based GPU #3851

Uh oh!

Conversation

kanghui0204 commented Apr 25, 2025

Low Precision Allreduce for PCIe based GPU

Description

Test Coverage

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

juney-nvidia commented Apr 26, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kanghui0204 commented May 9, 2025

Uh oh!

Uh oh!

Uh oh!

hyukn left a comment

Choose a reason for hiding this comment

Uh oh!

kanghui0204 commented May 9, 2025

Uh oh!

EmmaQiaoCh commented May 9, 2025

Uh oh!

tensorrt-cicd commented May 9, 2025

Uh oh!

EmmaQiaoCh commented May 13, 2025

Uh oh!

tensorrt-cicd commented May 13, 2025

Uh oh!

EmmaQiaoCh commented May 13, 2025

Uh oh!

tensorrt-cicd commented May 13, 2025

Uh oh!

tensorrt-cicd commented May 13, 2025

Uh oh!

tensorrt-cicd commented May 13, 2025

Uh oh!

hyukn commented May 13, 2025

Uh oh!

tensorrt-cicd commented May 13, 2025

Uh oh!

tensorrt-cicd commented May 13, 2025

Uh oh!

hyukn commented May 14, 2025

Uh oh!

tensorrt-cicd commented May 14, 2025

Uh oh!

tensorrt-cicd commented May 14, 2025

Uh oh!

EmmaQiaoCh commented May 14, 2025

Uh oh!

tensorrt-cicd commented May 14, 2025

Uh oh!

tensorrt-cicd commented May 14, 2025

Uh oh!

hyukn commented May 14, 2025

Uh oh!

tensorrt-cicd commented May 14, 2025

Uh oh!