Add efficient Cross-Entropy by cuda kernel to accelerate training speed and reduce cross-entropy memory usage during training. #995

cb521 · 2024-07-08T07:23:07Z

Description

  Hi, we found that in the cross-entropy implementation of Megatron-LLM, the input tensor needs to be converted to float for subsequent calculations. This will result in redundant GPU memory usage and time consumption. We used the CUDA kernel to fuse the torch op for two forward logic and one backward logic for cross-entropy computation. We performed float type conversion within the kernel and fused multiple independent calculation logics into one set of calculation logic. 

  To achieve this optimization, we developed  threes kernel in TransformerEngine and made corresponding changes in Megatron-LLM. At the same time, in order to test the performance of the cuda kernel, we also implemented the OpenAI Triton version of the kernel in Megatron-LLM and compared it with the cuda kernel.

  Finally, after our experimental verification. We found that using the CUDA kernel to optimize the current cross-entropy implementation can effectively improve training speed and reduce GPU memory usage. The test results are shown below:

As we can see, there are currently two cross-entropy implementations in Megatron-llm, one is the most primitive "original" and the other is called "fused". The implementation of "fused" is faster than "original" , but it consumes the most GPU memory. The two new implementations we have added this time, "Triton" and "CUDA Kernel," have faster training speeds and save moreGPU memory compared to the existing implementations of Megatron-LLM.

We also compared the convergence of the four implementations and found that the loss curves were basically the same, indicating that there was no problem with the calculation accuracy.

Notes:

We trained with 8-H100 for 4 hours to conduct the performance and accuracy tests mentioned above.
Wandb testing link: https://wandb.ai/megatron-core-moe-dev/binc-efficient-cross-entropy-no-moe?nw=nwuserbinc521
Wandb report link: https://api.wandb.ai/links/megatron-core-moe-dev/m1qovycf
Megatron-LLM MR: https://gitlab-master.nvidia.com/ADLR/megatron-lm/-/merge_requests/1846

Type of change

New feature (non-breaking change which adds functionality)

Changes

Please list the changes introduced in this PR:

Change A : For TE, we added three kernels, "CrossEntropyFwdSumExpKernel" + "CrossEntropyFwdMeanLogKernel" + "CrossEntropyBwdKernel".
Change B: For Mcore, we added a new cross-entropy, and added some if logics.

for more information, see https://pre-commit.ci

ksivaman · 2024-08-02T00:25:32Z

transformer_engine/pytorch/csrc/extensions/pybind.cpp

+  // Efficeint memory softmax cross entropy
+  m.def("cross_entropy_fwd_sum_exp_cuda", &cross_entropy_forward_sum_exp,
+        "Softmax Cross_entropy Forward Sum & Exp");
+  m.def("cross_entropy_fwd_mean_log_cuda", &cross_entropy_fwd_mean_log,
+        "Softmax Cross_entropy Forward Mean & Log");
+  m.def("cross_entropy_bwd_cuda", &cross_entropy_bwd, "Softmax Cross_entropy Backward");


Suggested change

// Efficeint memory softmax cross entropy

m.def("cross_entropy_fwd_sum_exp_cuda", &cross_entropy_forward_sum_exp,

"Softmax Cross_entropy Forward Sum & Exp");

m.def("cross_entropy_fwd_mean_log_cuda", &cross_entropy_fwd_mean_log,

"Softmax Cross_entropy Forward Mean & Log");

m.def("cross_entropy_bwd_cuda", &cross_entropy_bwd, "Softmax Cross_entropy Backward");

// Efficeint memory softmax cross entropy

m.def("cross_entropy_fwd_sum_exp_cuda", &cross_entropy_forward_sum_exp,

"Softmax Cross_entropy Forward Sum & Exp",

py::call_guard<py::gil_scoped_release>());

m.def("cross_entropy_fwd_mean_log_cuda", &cross_entropy_fwd_mean_log,

"Softmax Cross_entropy Forward Mean & Log",

py::call_guard<py::gil_scoped_release>());

m.def("cross_entropy_bwd_cuda", &cross_entropy_bwd, "Softmax Cross_entropy Backward",

py::call_guard<py::gil_scoped_release>());

Hi ksivaman, Thank you very much for taking the time to review the code for me！

I have already made the modifications now.

ksivaman · 2024-08-02T00:26:27Z

tests/pytorch/test_efficient_memory_cross_entropy.py

@@ -0,0 +1,135 @@
+import torch


Suggested change

import torch

# Copyright (c) 2022-2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

#

# See LICENSE for license information.

import torch

ksivaman · 2024-08-02T00:30:13Z

tests/pytorch/test_efficient_memory_cross_entropy.py

@@ -0,0 +1,135 @@
+import torch


This should be converted to use pytest similar to the remaining testing files. We would also need to add this to qa/L0_pytorch_unittest/test.sh to run this test in the CI.

Pytest has been added now, THX.

for more information, see https://pre-commit.ci

transformer_engine/pytorch/csrc/extensions.h

ptrendx · 2024-08-06T18:41:46Z