Skip to content

Conversation

@nikita-savelyevv
Copy link
Collaborator

Changes

Reason for changes

UX improvement. On phi3-mini-4k-instruct for FP8_E4M3 mode, I get about 17x time reduction: 191 sec. -> 11 sec.

Related tickets

Tests

@github-actions github-actions bot added the NNCF OpenVINO Pull requests that updates NNCF OpenVINO label Nov 20, 2025
_f16_to_f8e4m3_bits_vec = np.vectorize(_f16_to_f8e4m3_bits_scalar, otypes=[np.uint8])


def fp32_to_fp8e4m3_values(x: np.ndarray) -> np.ndarray:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reference conversion implementation looks a bit ugly unfortunately. It also can't be done within our nncf.Tensor framework yet, because _f16_to_f8e4m3_bits_scalar has to be vectorized with NumPy.

The reason behind all this, is that fp32 ->f8e4m3 conversion is done through fp16, i.e. fp32 -> fp16 -> f8e4m3. It is implemented this way on OpenVINO (and OneDNN) side because it is a more hardware friendly way. Please see the following links:

  1. OpenVINO implementation: https://github.com/openvinotoolkit/openvino/blame/master/src/core/src/type/float8_e4m3.cpp
  2. PR with addition of 2-step conversion: Implement 2-step conversion from fp32 to fp8 openvino#28501

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

NNCF OpenVINO Pull requests that updates NNCF OpenVINO

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant