Massively reduce LayerNorm/RMSNorm training memory usage by sharing saved tensor with other parts of the networks #430

RuiWang1998 · 2023-09-11T14:04:12Z

Saving the output of the normalization instead of the input reduces the memory cost in modern networks, where the output is going to be saved anyways (e.g., a Linear layer) and the input is only needed here otherwise.

We do this by changing the norm backward kernels to load from output instead of the input, and to compute normalized tensor from output instead of input. To stabilize gradients, we also clamp by magnitude the gamma value for division.

However, for now it seems that it comes at a price at a somewhat lower numerical precision. To be investiagated further.

For now we pass the operator tests if we increase the tolerance for layer normalization's dgamma and dx.

Effect

From the sequence parallel paper, layer norm costs 4sbh in memory per layer during training. However, if we save the output of the layer instead of the input, we can forget (free) those as they are not needed anywhere else. This amounts to ~1/6 of total activation cost of a transformer model.

Note that this may results in slightly higher numerical errors because we are using output for gradients by the time we use which, the rounding errors may propagate.

TODO

For now we are passing the operator tests, the tests in test_numerics.py (with added LayerNorm tests and gradients tests). We will very soon add more features including supports for fp8 and integration with python frameworks.
However, we note that this might require rethinking the dynamics of LayerNorm+Linear/MLP in the presence of fp8. Maybe eventually will just make the normalization layers output half/single precision data and save those instead of the inputs.

Also

See NVIDIA/apex#1715

Saving the output of the normalization instead of the input reduces the memory cost in modern networks, where the output is going to be saved anyways (e.g., a Linear layer) and the input is only needed here otherwise. We do this by changing the norm backward kernels to load from output instead of the input, and to compute normalized tensor from output instead of input. To stabilize gradients, we also clamp by magnitude the gamma value for division. For now we passed LayerNorm and RMSNorm tests with higher numerical differences which is expected as we use output to compute the gradient and the numerical errors propagate that way. Signed-off-by: Rui Wang <[email protected]>

Read/Write FP8 output in the backward/forward pass of the layer norm for even faster and more efficient forward/backward runs Signed-off-by: Rui Wang <[email protected]>

I also added a fp8 bwd implementation for both norms Signed-off-by: Rui Wang <[email protected]>

Also fixed a bug in calling the bwd function Should have been removed in the last commit Signed-off-by: Rui Wang <[email protected]>

ptrendx · 2023-09-18T19:09:46Z

@timmoon10 Could you help in reviewing this PR?

Fix linting for normalization.cu and other python files Revert import statements to its original version Also a tiny fix on inference of layernorm Signed-off-by: Rui Wang <[email protected]>

Signed-off-by: Rui Wang <[email protected]>

LayerNorm inference used a different api in torch Signed-off-by: Rui Wang <[email protected]>

ptrendx self-requested a review September 11, 2023 16:10

RuiWang1998 added 3 commits September 12, 2023 16:55

Support FP8 output for bwd/fwd of new norms

d07fe49

Read/Write FP8 output in the backward/forward pass of the layer norm for even faster and more efficient forward/backward runs Signed-off-by: Rui Wang <[email protected]>

Make LayerNormLinear runnable

8ee09ce

I also added a fp8 bwd implementation for both norms Signed-off-by: Rui Wang <[email protected]>

Make LayerNormMLP runnable in PyTorch

c3a2042

Also fixed a bug in calling the bwd function Should have been removed in the last commit Signed-off-by: Rui Wang <[email protected]>

RuiWang1998 force-pushed the rui/dev-mem-eff-ln-operator branch from 27688b4 to c3a2042 Compare September 13, 2023 03:20

RuiWang1998 added 3 commits October 4, 2023 23:17

Fix linting problems and revert import statements

f4076e9

Fix linting for normalization.cu and other python files Revert import statements to its original version Also a tiny fix on inference of layernorm Signed-off-by: Rui Wang <[email protected]>

Fix indent

8e206e8

Signed-off-by: Rui Wang <[email protected]>

Fix LayerNorm inference

2b451f0

LayerNorm inference used a different api in torch Signed-off-by: Rui Wang <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Massively reduce LayerNorm/RMSNorm training memory usage by sharing saved tensor with other parts of the networks #430

Massively reduce LayerNorm/RMSNorm training memory usage by sharing saved tensor with other parts of the networks #430

RuiWang1998 commented Sep 11, 2023

ptrendx commented Sep 18, 2023

Massively reduce LayerNorm/RMSNorm training memory usage by sharing saved tensor with other parts of the networks #430

Are you sure you want to change the base?

Massively reduce LayerNorm/RMSNorm training memory usage by sharing saved tensor with other parts of the networks #430

Conversation

RuiWang1998 commented Sep 11, 2023

Effect

TODO

Also

ptrendx commented Sep 18, 2023