debug api scale_inv_min is always zero when using padding or divisibility requirements

The debug_api provides the ability to acquire log information specific to FP8 through FP8TensorStats. However, we noticed that when we add padding (32 for MXFP8 or 16 for NVFP4) to the end of a sequence to make it divisible, the `scale_inv_min` is always zero.

This is because, there exists a consecutive block of 32 0s, which means that one of your `scale_inv` values will be 0, and therefore your `scale_inv_min` will always be zero. 

See this slide

<img width="960" height="537" alt="Image" src="https://github.com/user-attachments/assets/ee6ac382-51da-4192-af32-0f3b61385cb4" />

Thus, we are always going to have some zeros in that batch, because its a requirement, which makes `scale_inv_min` a non useful metric for us.




If NVIDIA docker image is used you don't need to specify these.
Otherwise, please provide:
- OS version
- PyTorch version
- Python version
- Transformer Engine version
- CUDA version
- CUDNN version

**Device details**
- GPU model

**Additional context**

Add any other context about the problem here.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

debug api scale_inv_min is always zero when using padding or divisibility requirements #2628

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

debug api scale_inv_min is always zero when using padding or divisibility requirements #2628

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions