feat: Low Precision Allreduce for PCIe based GPU #4344

kanghui0204 · 2025-05-15T02:52:28Z

last PR：#3851
last revet PR：#4340

hyukn · 2025-05-15T02:54:38Z

/bot run --stage-list "DGX_H100-4_GPUs-PyTorch-Others-1"

tensorrt-cicd · 2025-05-15T03:01:31Z

PR_Github #5257 [ run ] triggered by Bot

hyukn · 2025-05-15T08:26:24Z

/bot kill

tensorrt-cicd · 2025-05-15T08:32:25Z

PR_Github #5313 [ kill ] triggered by Bot

tensorrt-cicd · 2025-05-15T08:32:27Z

PR_Github #5257 [ run ] completed with state ABORTED

tensorrt-cicd · 2025-05-15T08:32:57Z

PR_Github #5313 [ kill ] completed with state SUCCESS
Successfully killed previous jobs for commit 8614f2c

hyukn · 2025-05-16T00:05:09Z

/bot run --stage-list "DGX_H100-4_GPUs-PyTorch-Others-1"

tensorrt-cicd · 2025-05-16T00:11:03Z

PR_Github #5420 [ run ] triggered by Bot

EmmaQiaoCh · 2025-05-16T02:19:59Z

/bot run --stage-list "DGX_H100-4_GPUs-PyTorch-Others-1"

tensorrt-cicd · 2025-05-16T02:21:02Z

PR_Github #5420 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #3956 (Partly Tested) completed with status: 'SUCCESS'

tensorrt-cicd · 2025-05-16T02:27:08Z

PR_Github #5437 [ run ] triggered by Bot

tensorrt-cicd · 2025-05-16T05:27:51Z

PR_Github #5437 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #3968 (Partly Tested) completed with status: 'SUCCESS'

hyukn · 2025-05-16T05:46:14Z

/bot run --disable-fail-fast --add-multi-gpu-test

tensorrt-cicd · 2025-05-16T05:57:12Z

PR_Github #5460 [ run ] triggered by Bot

yizhang-nv · 2025-05-16T09:19:54Z

Is this set of kernels considered cuda graph support? If the barrier flag is captured, during the graph replay, it may cause issues since we depend on the value of the barrier flag and the comm buffer to make sure every gpu reach to the same barrier.

For the captured value, if the model has odd number of ar, then it may select the same peer comm buffer here:
https://github.com/NVIDIA/TensorRT-LLM/pull/4344/files#diff-fd189077a08106939fbdaf23180ba0bc7d81d76279c632db9097381e8440b2c9R1422

kanghui0204 · 2025-05-16T09:34:12Z

Is this set of kernels considered cuda graph support? If the barrier flag is captured, during the graph replay, it may cause issues since we depend on the value of the barrier flag and the comm buffer to make sure every gpu reach to the same barrier.

For the captured value, if the model has odd number of ar, then it may select the same peer comm buffer here: https://github.com/NVIDIA/TensorRT-LLM/pull/4344/files#diff-fd189077a08106939fbdaf23180ba0bc7d81d76279c632db9097381e8440b2c9R1422

Do all our current kernels need to support CUDA graphs? I haven't tested these kernels on CUDA graphs.

hyukn · 2025-05-16T10:08:47Z

/bot run --add-multi-gpu-test

tensorrt-cicd · 2025-05-16T10:15:41Z

PR_Github #5503 [ run ] triggered by Bot

tensorrt-cicd · 2025-05-16T10:15:46Z

PR_Github #5460 [ run ] completed with state ABORTED

tensorrt-cicd · 2025-05-16T20:12:43Z

PR_Github #5503 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #4010 completed with status: 'FAILURE'

kanghui0204 · 2025-05-17T09:30:09Z

/bot run --add-multi-gpu-test

EmmaQiaoCh · 2025-05-18T06:26:09Z

/bot run

tensorrt-cicd · 2025-05-18T06:32:08Z

PR_Github #5596 [ run ] triggered by Bot

tensorrt-cicd · 2025-05-18T08:32:25Z

PR_Github #5596 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #4082 completed with status: 'FAILURE'

hyukn · 2025-05-19T00:58:19Z

/bot run --add-multi-gpu-test

tensorrt-cicd · 2025-05-19T01:08:31Z

PR_Github #5644 [ run ] triggered by Bot

tensorrt-cicd · 2025-05-19T04:18:36Z

PR_Github #5644 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #4124 completed with status: 'FAILURE'

Signed-off-by: Hui Kang <[email protected]>

hyukn · 2025-05-19T04:38:18Z

/bot run --add-multi-gpu-test

tensorrt-cicd · 2025-05-19T04:44:16Z

PR_Github #5673 [ run ] triggered by Bot

tensorrt-cicd · 2025-05-19T15:42:49Z

PR_Github #5673 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #4144 completed with status: 'SUCCESS'

hyukn · 2025-05-19T22:52:47Z

Pipeline passed. Merge this PR.

kanghui0204 requested review from QiJune and hyukn May 15, 2025 02:52

kanghui0204 force-pushed the low_precision_allreduce_for_pcie branch from 8614f2c to fb827d8 Compare May 15, 2025 16:01

kanghui0204 force-pushed the low_precision_allreduce_for_pcie branch from fb827d8 to b3e1189 Compare May 16, 2025 02:17

kanghui0204 force-pushed the low_precision_allreduce_for_pcie branch from b3e1189 to 60360e1 Compare May 16, 2025 05:45

kanghui0204 force-pushed the low_precision_allreduce_for_pcie branch from 22e54f5 to 6ed9c56 Compare May 17, 2025 09:29

kanghui0204 force-pushed the low_precision_allreduce_for_pcie branch 2 times, most recently from bd92a55 to 7097900 Compare May 19, 2025 00:22

low precision allreduce

ec197d9

Signed-off-by: Hui Kang <[email protected]>

kanghui0204 force-pushed the low_precision_allreduce_for_pcie branch from 7097900 to ec197d9 Compare May 19, 2025 04:21

hyukn approved these changes May 19, 2025

View reviewed changes

hyukn merged commit 6f3922f into NVIDIA:main May 19, 2025
3 checks passed

feat: Low Precision Allreduce for PCIe based GPU #4344

feat: Low Precision Allreduce for PCIe based GPU #4344

Uh oh!

Conversation

kanghui0204 commented May 15, 2025

Uh oh!

hyukn commented May 15, 2025

Uh oh!

tensorrt-cicd commented May 15, 2025

Uh oh!

hyukn commented May 15, 2025

Uh oh!

tensorrt-cicd commented May 15, 2025

Uh oh!

tensorrt-cicd commented May 15, 2025

Uh oh!

tensorrt-cicd commented May 15, 2025

Uh oh!

hyukn commented May 16, 2025

Uh oh!

tensorrt-cicd commented May 16, 2025

Uh oh!

EmmaQiaoCh commented May 16, 2025

Uh oh!

tensorrt-cicd commented May 16, 2025

Uh oh!

tensorrt-cicd commented May 16, 2025

Uh oh!

tensorrt-cicd commented May 16, 2025

Uh oh!

hyukn commented May 16, 2025

Uh oh!

tensorrt-cicd commented May 16, 2025

Uh oh!

yizhang-nv commented May 16, 2025

Uh oh!

kanghui0204 commented May 16, 2025

Uh oh!

hyukn commented May 16, 2025

Uh oh!

tensorrt-cicd commented May 16, 2025

Uh oh!

tensorrt-cicd commented May 16, 2025

Uh oh!

tensorrt-cicd commented May 16, 2025

Uh oh!

kanghui0204 commented May 17, 2025

Uh oh!

EmmaQiaoCh commented May 18, 2025

Uh oh!

tensorrt-cicd commented May 18, 2025

Uh oh!

tensorrt-cicd commented May 18, 2025

Uh oh!

hyukn commented May 19, 2025

Uh oh!

tensorrt-cicd commented May 19, 2025

Uh oh!

tensorrt-cicd commented May 19, 2025

Uh oh!

hyukn commented May 19, 2025

Uh oh!

tensorrt-cicd commented May 19, 2025

Uh oh!

tensorrt-cicd commented May 19, 2025

Uh oh!

hyukn commented May 19, 2025

Uh oh!

Uh oh!

Uh oh!