NVIDIA / TransformerEngine Public

Notifications
Fork 366
Star 2.2k

Additional navigation options

Code
Issues
Pull requests
Discussions
Actions
Security
Insights

Pull requests: NVIDIA/TransformerEngine

Labels 39 Milestones 0

New pull request New

Clear current search query, filters, and sorts

60 Open 1,030 Closed

Author

Filter by author

Label

Filter by label

Use alt + click/return to exclude labels

or ⇧ + click/return for logical OR

Projects

Filter by project

Milestones

Filter by milestone

Reviews

Filter by reviews

No reviews Review required Approved review Changes requested

Assignee

Filter by who’s assigned

Assigned to nobody

Sort

Sort by

Newest Oldest Most commented Least commented Recently updated Least recently updated Best match

Most reactions

Pull requests list

Massively reduce LayerNorm/RMSNorm training memory usage by sharing saved tensor with other parts of the networks

#430 opened Sep 11, 2023 by RuiWang1998 • Draft

[UB] Adding support for multinode nvlink

#815 opened Apr 26, 2024 by shamisp

[Pytorch] Implement fp32 accumulation for attention with context parallel in both forward and backward pass.

#821 opened Apr 28, 2024 by Yuxin-CV

Generation tutorial for Gemma model

#829 opened May 1, 2024 by pggPL

8 of 11 tasks

Fp8 model init factory

#880 opened May 30, 2024 by sudhakarsingh27 • Draft

Lower memory usage during AttnFuncWithCP.forward

#951 opened Jun 21, 2024 by i4never

Loading…

8 of 13 tasks

[pre-commit.ci] pre-commit suggestions wontfix

This will not be worked on

#979 opened Jul 2, 2024 by pre-commit-ci bot • Draft

Add efficient Cross-Entropy by cuda kernel to accelerate training speed and reduce cross-entropy memory usage during training.

#995 opened Jul 8, 2024 by cb521

Loading…

1 task

[JAX] Sharding Utils

#1003 opened Jul 9, 2024 by mingxu1067 • Draft

8 of 13 tasks

Flash attention support softcap.

#1013 opened Jul 14, 2024 by Lzhang-hub

Loading…

7 tasks

Change condition for ub tp overlap.

#1055 opened Jul 29, 2024 by Victarry

Loading…

1 of 13 tasks

Use pyproject.toml to specify build requirements build

Build system

#1061 opened Jul 30, 2024 by ksivaman

Loading…

6 of 13 tasks

Add high_precision_init_val to model params when using fp8_model_init

#1121 opened Aug 19, 2024 by kunlunl

Loading…

8 of 13 tasks

Fix param input order for cudagraph bug

Something isn't working

#1138 opened Aug 27, 2024 by yifeis-nv

Loading…

4 of 13 tasks

Norms Refractor

#1140 opened Aug 27, 2024 by phu0ngng • Draft

5 of 13 tasks

[PyTorch] Avoid saving fp8_tensors in certain scenarios

#1143 opened Aug 28, 2024 by cyanguwa

Loading…

8 of 13 tasks

Fix autocast deprecation warning.

#1167 opened Sep 6, 2024 by jondeaton

Loading…

[WIP] [PyTorch] Proof-of-concept for using operation-based API in modules

#1173 opened Sep 10, 2024 by timmoon10 • Draft

2 of 13 tasks

Draft: Use fused push_send_recv kernel for TP AG and RS overlaps

#1200 opened Sep 24, 2024 by erhoo82

Loading…

13 tasks

[PyTorch] Improve CP P2P efficiency

#1208 opened Sep 26, 2024 by yenchenlin

Loading…

1 of 6 tasks

Save CUDA Graph memory by reusing input and output tensors

#1234 opened Oct 9, 2024 by buptzyb

Loading…

5 of 13 tasks

fused out correction in CP

#1248 opened Oct 14, 2024 by xiaoyao0115

Loading…

12 tasks

[pyTorch] Infrastructure for C++ QuantizedTensor

#1251 opened Oct 14, 2024 by ptrendx • Draft

13 tasks

Draft: reduce cudagraph mem via preoallcations

#1253 opened Oct 15, 2024 by JimmyZhang12

Loading…

13 tasks

attention_mask fill with -inf for UnfusedDotProductAttention

#1268 opened Oct 18, 2024 by Agoniii

Loading…

1 of 13 tasks

Previous 1 2 3 Next

Previous Next

ProTip! no:milestone will show everything without a milestone.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly