-
Notifications
You must be signed in to change notification settings - Fork 366
Pull requests: NVIDIA/TransformerEngine
Author
Label
Projects
Milestones
Reviews
Assignee
Sort
Pull requests list
[Pytorch] Implement fp32 accumulation for attention with context parallel in both forward and backward pass.
#821
opened Apr 28, 2024 by
Yuxin-CV
Lower memory usage during AttnFuncWithCP.forward
#951
opened Jun 21, 2024 by
i4never
Loading…
8 of 13 tasks
[pre-commit.ci] pre-commit suggestions
wontfix
This will not be worked on
#979
opened Jul 2, 2024 by
pre-commit-ci
bot
•
Draft
Add efficient Cross-Entropy by cuda kernel to accelerate training speed and reduce cross-entropy memory usage during training.
#995
opened Jul 8, 2024 by
cb521
Loading…
1 task
Use pyproject.toml to specify build requirements
build
Build system
#1061
opened Jul 30, 2024 by
ksivaman
Loading…
6 of 13 tasks
Add high_precision_init_val to model params when using fp8_model_init
#1121
opened Aug 19, 2024 by
kunlunl
Loading…
8 of 13 tasks
Fix param input order for cudagraph
bug
Something isn't working
#1138
opened Aug 27, 2024 by
yifeis-nv
Loading…
4 of 13 tasks
[PyTorch] Avoid saving fp8_tensors in certain scenarios
#1143
opened Aug 28, 2024 by
cyanguwa
Loading…
8 of 13 tasks
Draft: Use fused push_send_recv kernel for TP AG and RS overlaps
#1200
opened Sep 24, 2024 by
erhoo82
Loading…
13 tasks
Save CUDA Graph memory by reusing input and output tensors
#1234
opened Oct 9, 2024 by
buptzyb
Loading…
5 of 13 tasks
Draft: reduce cudagraph mem via preoallcations
#1253
opened Oct 15, 2024 by
JimmyZhang12
Loading…
13 tasks
attention_mask fill with -inf for UnfusedDotProductAttention
#1268
opened Oct 18, 2024 by
Agoniii
Loading…
1 of 13 tasks
Previous Next
ProTip!
no:milestone will show everything without a milestone.