Skip to content

Conversation

@quic-morteza
Copy link
Contributor

This fix is needed is tested successfully for the replicate_kv_heads.py script to operate on FP8-Dynamic models.

def mutate(cls, original_module, parent_module):
# -- de-quantizing the weights --
dequant_weights = original_module.weight.to(torch.float32) * original_module.weight_scale
dequant_weights = original_module.weight.to(torch.float32) # * original_module.weight_scale
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this removed?

# Only inference supported
with torch.no_grad():
dequantized_weights = self.weight.to(torch.float32) * self.weight_scale
dequantized_weights = self.weight.to(torch.float32) # * self.weight_scale
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this removed?

Comment on lines +56 to +59
if layer.bias is not None:
layer.bias.data = torch.repeat_interleave(layer.bias.data.view(orig_kv_heads, head_dim), repeat, 0).view(
new_kv_heads * head_dim
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@quic-morteza
Copy link
Contributor Author

I investigated my code changes further and noticed that the output results are inconsistent with the ground truth. Therefore, my code changes are invalid, and I will close the PR I submitted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants