-
Notifications
You must be signed in to change notification settings - Fork 1.8k
feat: Low Precision Allreduce for PCIe based GPU #4344
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Low Precision Allreduce for PCIe based GPU #4344
Conversation
/bot run --stage-list "DGX_H100-4_GPUs-PyTorch-Others-1" |
PR_Github #5257 [ run ] triggered by Bot |
/bot kill |
PR_Github #5313 [ kill ] triggered by Bot |
PR_Github #5257 [ run ] completed with state |
PR_Github #5313 [ kill ] completed with state |
8614f2c
to
fb827d8
Compare
/bot run --stage-list "DGX_H100-4_GPUs-PyTorch-Others-1" |
PR_Github #5420 [ run ] triggered by Bot |
fb827d8
to
b3e1189
Compare
/bot run --stage-list "DGX_H100-4_GPUs-PyTorch-Others-1" |
PR_Github #5420 [ run ] completed with state |
PR_Github #5437 [ run ] triggered by Bot |
PR_Github #5437 [ run ] completed with state |
b3e1189
to
60360e1
Compare
/bot run --disable-fail-fast --add-multi-gpu-test |
PR_Github #5460 [ run ] triggered by Bot |
Is this set of kernels considered cuda graph support? If the barrier flag is captured, during the graph replay, it may cause issues since we depend on the value of the barrier flag and the comm buffer to make sure every gpu reach to the same barrier. For the captured value, if the model has odd number of ar, then it may select the same peer comm buffer here: |
Do all our current kernels need to support CUDA graphs? I haven't tested these kernels on CUDA graphs. |
/bot run --add-multi-gpu-test |
PR_Github #5503 [ run ] triggered by Bot |
PR_Github #5460 [ run ] completed with state |
PR_Github #5503 [ run ] completed with state |
22e54f5
to
6ed9c56
Compare
/bot run --add-multi-gpu-test |
/bot run |
PR_Github #5596 [ run ] triggered by Bot |
PR_Github #5596 [ run ] completed with state |
bd92a55
to
7097900
Compare
/bot run --add-multi-gpu-test |
PR_Github #5644 [ run ] triggered by Bot |
PR_Github #5644 [ run ] completed with state |
Signed-off-by: Hui Kang <[email protected]>
7097900
to
ec197d9
Compare
/bot run --add-multi-gpu-test |
PR_Github #5673 [ run ] triggered by Bot |
PR_Github #5673 [ run ] completed with state |
Pipeline passed. Merge this PR. |
last PR:#3851
last revet PR:#4340