Skip to content

feat: Low Precision Allreduce for PCIe based GPU #3851

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 14, 2025

Conversation

kanghui0204
Copy link
Collaborator

Low Precision Allreduce for PCIe based GPU

[feat] Support a new feature for low precision allreduce for PCIe based GPU

Description

This PR adds a customized allreduce to TensorRT-LLM. The new allreduce is used for communication on PCIe-based GPUs via low-precision quantization, which can accelerate the PCIe allreduce process.

Test Coverage

I added a unit test for this feature: tests/unittest/_torch/multi_gpu/test_lowprecision_allreduce.py. I hope this test can run on L40 or L20 nodes; otherwise, the low-precision allreduce will fall back to NCCL, making the test meaningless and causing it to fail.

@kanghui0204 kanghui0204 force-pushed the low_precision_allreduce_for_pcie branch 3 times, most recently from 86140b5 to 4613735 Compare April 25, 2025 08:51
@juney-nvidia
Copy link
Collaborator

@dongxuy04 @yuxianq Hi Dongxu, Yuxian, can you help review this MR?

Thanks
June

@juney-nvidia juney-nvidia changed the title FEATURE:Low Precision Allreduce for PCIe based GPU feat:Low Precision Allreduce for PCIe based GPU Apr 26, 2025
@kanghui0204 kanghui0204 force-pushed the low_precision_allreduce_for_pcie branch 3 times, most recently from 3ab2d21 to ae1d6d4 Compare April 28, 2025 10:06
@kanghui0204 kanghui0204 force-pushed the low_precision_allreduce_for_pcie branch 2 times, most recently from 38ddfa1 to dd8ee0d Compare May 3, 2025 14:56
@hyukn hyukn requested a review from yizhang-nv May 6, 2025 03:14
@kanghui0204 kanghui0204 force-pushed the low_precision_allreduce_for_pcie branch 2 times, most recently from c0f6e3a to 09bcfdc Compare May 6, 2025 14:44
@kanghui0204 kanghui0204 force-pushed the low_precision_allreduce_for_pcie branch 2 times, most recently from a4de314 to 7816b41 Compare May 9, 2025 04:26
@kanghui0204
Copy link
Collaborator Author

/bot run --disable-fail-fast --add-multi-gpu-test

Copy link
Collaborator

@hyukn hyukn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some nits. LGTM.

@hyukn hyukn changed the title feat:Low Precision Allreduce for PCIe based GPU feat: Low Precision Allreduce for PCIe based GPU May 9, 2025
@kanghui0204 kanghui0204 force-pushed the low_precision_allreduce_for_pcie branch from 7816b41 to 222c4a9 Compare May 9, 2025 06:23
@kanghui0204
Copy link
Collaborator Author

/bot run --disable-fail-fast --add-multi-gpu-test

1 similar comment
@EmmaQiaoCh
Copy link
Collaborator

/bot run --disable-fail-fast --add-multi-gpu-test

@tensorrt-cicd
Copy link
Collaborator

PR_Github #4678 [ run ] triggered by Bot

@kanghui0204 kanghui0204 force-pushed the low_precision_allreduce_for_pcie branch from 60d0eec to 9a1e741 Compare May 13, 2025 01:20
@EmmaQiaoCh
Copy link
Collaborator

/bot run --stage-list "H100_PCIe-PyTorch-1"

@tensorrt-cicd
Copy link
Collaborator

PR_Github #4929 [ run ] triggered by Bot

@EmmaQiaoCh
Copy link
Collaborator

/bot run --disable-fail-fast --add-multi-gpu-test

@tensorrt-cicd
Copy link
Collaborator

PR_Github #4939 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #4929 [ run ] completed with state ABORTED

@tensorrt-cicd
Copy link
Collaborator

PR_Github #4939 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #3582 completed with status: 'FAILURE'

@hyukn
Copy link
Collaborator

hyukn commented May 13, 2025

/bot run --disable-fail-fast --only-multi-gpu-test

@tensorrt-cicd
Copy link
Collaborator

PR_Github #4980 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #4980 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #3617 (Partly Tested) completed with status: 'FAILURE'

@kanghui0204 kanghui0204 force-pushed the low_precision_allreduce_for_pcie branch from 9a1e741 to 4f2c648 Compare May 14, 2025 01:19
@hyukn
Copy link
Collaborator

hyukn commented May 14, 2025

/bot run --disable-fail-fast --add-multi-gpu-test

@tensorrt-cicd
Copy link
Collaborator

PR_Github #5072 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #5072 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #3693 completed with status: 'FAILURE'

@EmmaQiaoCh
Copy link
Collaborator

/bot run --stage-list "B200_PCIe-PackageSanityCheck-DLFW"

@tensorrt-cicd
Copy link
Collaborator

PR_Github #5112 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #5112 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #3724 (Partly Tested) completed with status: 'SUCCESS'

@hyukn hyukn enabled auto-merge (squash) May 14, 2025 08:27
@hyukn hyukn force-pushed the low_precision_allreduce_for_pcie branch from 4f2c648 to 8614f2c Compare May 14, 2025 08:27
@hyukn
Copy link
Collaborator

hyukn commented May 14, 2025

/bot reuse-pipeline

@hyukn hyukn disabled auto-merge May 14, 2025 08:31
@tensorrt-cicd
Copy link
Collaborator

PR_Github #5150 [ reuse-pipeline ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #5150 [ reuse-pipeline ] completed with state SUCCESS
Reusing PR_Github #5112 (Partly Tested) for commit 8614f2c

@hyukn hyukn merged commit 5e634dd into NVIDIA:main May 14, 2025
2 checks passed
QiJune added a commit that referenced this pull request May 15, 2025
@QiJune
Copy link
Collaborator

QiJune commented May 15, 2025

Hi @kanghui0204 @hyukn , we found that multi-gpu test pipelines are not triggered as expected in this PR due to /bot reuse-pipeline. At the same time, the post merge multi-gpu test pipelines hangs. So, we revert this PR to fix broken CI first. Feel free to submit a new PR, and trigger multi-gpu test pipelines by /bot run --only-multi-gpu-test

QiJune added a commit that referenced this pull request May 15, 2025
Revert "feat: Low Precision Allreduce for PCIe based GPU (#3851)"

This reverts commit 5e634dd.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants