The debug_api provides the ability to acquire log information specific to FP8 through FP8TensorStats. However, we noticed that when we add padding (32 for MXFP8 or 16 for NVFP4) to the end of a sequence to make it divisible, the scale_inv_min is always zero.
This is because, there exists a consecutive block of 32 0s, which means that one of your scale_inv values will be 0, and therefore your scale_inv_min will always be zero.
See this slide
Thus, we are always going to have some zeros in that batch, because its a requirement, which makes scale_inv_min a non useful metric for us.
If NVIDIA docker image is used you don't need to specify these.
Otherwise, please provide:
- OS version
- PyTorch version
- Python version
- Transformer Engine version
- CUDA version
- CUDNN version
Device details
Additional context
Add any other context about the problem here.