Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Paged attention changes to THD attention #3

Draft
wants to merge 124 commits into
base: te_gemma_generation_support
Choose a base branch
from

Conversation

sudhakarsingh27
Copy link
Owner

Description

Checking how difficult it is to merge Paged Attention changes into THD Attention changes

jennifgcrl and others added 30 commits November 12, 2024 20:30
fix an int conversion error

Signed-off-by: Jennifer Zhou <[email protected]>
Debug ONNX export with te.Sequential

ONNX export assumes that all state dict objects are tensor, even extra state.

Signed-off-by: Tim Moon <[email protected]>
…tructure (NVIDIA#1326)

* Remove manual FP8 scale update for FP8 params

Signed-off-by: Tim Moon <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* lint

Signed-off-by: Kirthi Shankar Sivamani <[email protected]>

---------

Signed-off-by: Tim Moon <[email protected]>
Signed-off-by: Kirthi Shankar Sivamani <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Kirthi Shankar Sivamani <[email protected]>
…#1334)

* Limit to one call of ctx.saved_tensors per autograd bwd

Signed-off-by: Kirthi Shankar Sivamani <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Kirthi Shankar Sivamani <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Add activation ops

Signed-off-by: Tim Moon <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix lint warnings

Signed-off-by: Tim Moon <[email protected]>

* Fix linter warning

Signed-off-by: Tim Moon <[email protected]>

* Update to use QuantizedTensor

Signed-off-by: Tim Moon <[email protected]>

* Respect PyTorch autograd dtype

Signed-off-by: Tim Moon <[email protected]>

* Rename CastFloat8 op to Quantize

Signed-off-by: Tim Moon <[email protected]>

* Add support for fused dSwiGLU-cast-transpose

Signed-off-by: Tim Moon <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Tim Moon <[email protected]>
Signed-off-by: Tim Moon <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Signed-off-by: Przemek Tredak <[email protected]>
…1333)

use CMAKE_CURRENT_SOURCE_DIR instead of CMAKE_SOURCE_DIR

Signed-off-by: Kenichi Maehashi <[email protected]>
* fix GQA error message

Signed-off-by: Charlene Yang <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Charlene Yang <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Handle deprecated `hidden_size` arg in norm modules

Signed-off-by: Tim Moon <[email protected]>

* Support initializing norm ops on CPU

Signed-off-by: Tim Moon <[email protected]>

* Add integration test for Megatron-LM

Signed-off-by: Tim Moon <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Rename Mcore integration test

Signed-off-by: Tim Moon <[email protected]>

* Handle case in RMSNorm where hidden dim is not provided

Signed-off-by: Tim Moon <[email protected]>

---------

Signed-off-by: Tim Moon <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Add helper function to convert C++ container to string

Signed-off-by: Tim Moon <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Tim Moon <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Align RNG tracker with megatron

Signed-off-by: Robin Zhang <[email protected]>
Co-authored-by: Yifei Song <[email protected]>

* Fix module_params order and warmup bug in cudagraph

Signed-off-by: Robin Zhang <[email protected]>
Co-authored-by: Yifei Song <[email protected]>

* Add fp8_group argument and fix fp8 accuracy issue for cudagraph

Signed-off-by: Robin Zhang <[email protected]>
Co-authored-by: Yifei Song <[email protected]>

* Add TE modules and weights filters to support MoE models

Signed-off-by: Robin Zhang <[email protected]>
Co-authored-by: Yifei Song <[email protected]>

* Revert self.fp8

Signed-off-by: Robin Zhang <[email protected]>

* Use hooks to filter module params

Signed-off-by: Robin Zhang <[email protected]>

* Filter all TE modules in hooks

Signed-off-by: Robin Zhang <[email protected]>
Co-authored-by: Yifei Song <[email protected]>

* Format code

Signed-off-by: Robin Zhang <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update graph.py

Signed-off-by: Xin Yao <[email protected]>

* Revert CudaRNGStatesTracker

Signed-off-by: Robin Zhang <[email protected]>

* Format Update

Signed-off-by: Yifei Song <[email protected]>

* Revert "Use hooks to filter module params"

This reverts commit 73a22e2.

Signed-off-by: Yifei Song <[email protected]>

* Remove filtering module params

Signed-off-by: Robin Zhang <[email protected]>

---------

Signed-off-by: Robin Zhang <[email protected]>
Signed-off-by: Xin Yao <[email protected]>
Signed-off-by: Yifei Song <[email protected]>
Co-authored-by: Yifei Song <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Xin Yao <[email protected]>
Co-authored-by: Xin Yao <[email protected]>
Co-authored-by: Tim Moon <[email protected]>
Moved framework agnostic THD kernels to common.

---------

Signed-off-by: Michael Goldfarb <[email protected]>
* retain_graph=True for grouped gemm

Signed-off-by: Xiaowei Ren <[email protected]>

* remove an unnecessary retain_graph=True

Signed-off-by: Xiaowei Ren <[email protected]>

* make retain_graph in graph capture configurable

Signed-off-by: Xiaowei Ren <[email protected]>

* typo fix

Signed-off-by: Xiaowei Ren <[email protected]>

---------

Signed-off-by: Xiaowei Ren <[email protected]>
* Update list of CI users

Signed-off-by: Tim Moon <[email protected]>

* Update list of CI users

Signed-off-by: Tim Moon <[email protected]>

---------

Signed-off-by: Tim Moon <[email protected]>
…age (NVIDIA#1308)

* draft implementation

Signed-off-by: Youngeun Kwon <[email protected]>

* compile error fix

Signed-off-by: Youngeun Kwon <[email protected]>

* fix compile error

Signed-off-by: Youngeun Kwon <[email protected]>

* remove print

Signed-off-by: Youngeun Kwon <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Edit comments

Signed-off-by: Youngeun Kwon <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* edit the bulk-overlap test case

Signed-off-by: Youngeun Kwon <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add version guard

Signed-off-by: Youngeun Kwon <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add runtime version guard

Signed-off-by: Youngeun Kwon <[email protected]>

* fix the version guard

Signed-off-by: Youngeun Kwon <[email protected]>

---------

Signed-off-by: Youngeun Kwon <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Signed-off-by: Charlene Yang <[email protected]>
Signed-off-by: Charlene Yang <[email protected]>
Signed-off-by: Charlene Yang <[email protected]>
Signed-off-by: Charlene Yang <[email protected]>
…1347)

Scale sequence length in CP tests to avoid tiny sizes.

Signed-off-by: Michael Goldfarb <[email protected]>
Debug jobs to deploy nightly docs

Signed-off-by: Tim Moon <[email protected]>
Store module extra state in tensor

Signed-off-by: Tim Moon <[email protected]>
* always have padding mask type for both flash and fused attentions

Signed-off-by: Xiaowei Ren <[email protected]>

* remove an redundant assert

Signed-off-by: Xiaowei Ren <[email protected]>

---------

Signed-off-by: Xiaowei Ren <[email protected]>
Debug Mcore integration test

Avoid FP8 on Ampere and older. Generate synthetic data instead of depending on external data.

Signed-off-by: Tim Moon <[email protected]>
timmoon10 and others added 3 commits February 19, 2025 02:40
Fix typo

Signed-off-by: Tim Moon <[email protected]>
Co-authored-by: Kirthi Shankar Sivamani <[email protected]>
* fix fuse_wgrad_accumulation for GroupedLinear

Signed-off-by: Xin Yao <[email protected]>

* fix fuse_wgrad_accumulation for GroupedLinear

Signed-off-by: Xin Yao <[email protected]>

* update tests

Signed-off-by: Xin Yao <[email protected]>

---------

Signed-off-by: Xin Yao <[email protected]>
Co-authored-by: Tim Moon <[email protected]>
ksivaman and others added 3 commits February 19, 2025 23:30
* Fix te sequential for older pytorch versions

Signed-off-by: Kirthi Shankar Sivamani <[email protected]>

* FIxes

Signed-off-by: Kirthi Shankar Sivamani <[email protected]>

---------

Signed-off-by: Kirthi Shankar Sivamani <[email protected]>
* commit some debug code

Signed-off-by: Xiaowei Ren <[email protected]>

* add more debug info

Signed-off-by: Xiaowei Ren <[email protected]>

* debug code commit and typo fix

Signed-off-by: Xiaowei Ren <[email protected]>

* a typo fix

Signed-off-by: Xiaowei Ren <[email protected]>

* remove debug info

Signed-off-by: Xiaowei Ren <[email protected]>

* do not return lse

Signed-off-by: Xiaowei Ren <[email protected]>

* add amax_per_step for quantizers of CP

Signed-off-by: Xiaowei Ren <[email protected]>

* fix FP8 + CP

Signed-off-by: Xiaowei Ren <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* bug fix

Signed-off-by: Xiaowei Ren <[email protected]>

* bug fix

Signed-off-by: Xiaowei Ren <[email protected]>

* dtype fix

Signed-off-by: Xiaowei Ren <[email protected]>

* bug fix

Signed-off-by: Xiaowei Ren <[email protected]>

---------

Signed-off-by: Xiaowei Ren <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Xiaowei Ren <[email protected]>
Signed-off-by: Charlene Yang <[email protected]>
@cyanguwa cyanguwa force-pushed the paged_attention branch 2 times, most recently from 6fcad33 to f5b91c6 Compare February 21, 2025 23:33
Signed-off-by: Charlene Yang <[email protected]>
Signed-off-by: Charlene Yang <[email protected]>
Signed-off-by: Charlene Yang <[email protected]>
pre-commit-ci bot and others added 6 commits February 22, 2025 01:41
…NVIDIA#1466)

Use same API in optimizer zero_grad as PyT optimizers

Signed-off-by: Tim Moon <[email protected]>
…1498)

* Remove dependency on transformer_engine::Tensor in attention.cu

Signed-off-by: Kshitij Janardan Lakhani <[email protected]>

* Templatize thd_partition_indices_kernel and thd_read_half_tensor_kernel kernels ONLY for invoking recompilation and not directly using the pre-compiled symbols in libtransformer.so

Signed-off-by: Kshitij Janardan Lakhani <[email protected]>

* Modify attention.cu for thd templatized kernels. Remove dependency on common.h

Signed-off-by: Kshitij Janardan Lakhani <[email protected]>

* Move thd structs from libtransformer.so to framework extensions include header

Code cleanup

Signed-off-by: Kshitij Janardan Lakhani <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Consolidate and move thd_utils from common to framework extensions

Signed-off-by: Kshitij Janardan Lakhani <[email protected]>

* Remove template decorators around thd_partition_indices_kernel and thd_read_half_tensor_kernel

Signed-off-by: Kshitij Janardan Lakhani <[email protected]>

Code clean up

Signed-off-by: Kshitij Janardan Lakhani <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Kshitij Janardan Lakhani <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Signed-off-by: Charlene Yang <[email protected]>
pre-commit-ci bot and others added 8 commits February 23, 2025 18:02
Signed-off-by: Charlene Yang <[email protected]>
Signed-off-by: Charlene Yang <[email protected]>
* fix

Signed-off-by: Pawel Gadzinski <[email protected]>

* reshape inp

Signed-off-by: Pawel Gadzinski <[email protected]>

---------

Signed-off-by: Pawel Gadzinski <[email protected]>
* non-exit tests

Signed-off-by: Pawel Gadzinski <[email protected]>

* fix

Signed-off-by: Pawel Gadzinski <[email protected]>

* fix

Signed-off-by: Pawel Gadzinski <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Pawel Gadzinski <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Signed-off-by: Charlene Yang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.