Skip to content

Conversation

@helena-intel
Copy link
Collaborator

This is #1297 updated to latest main branch.

Currently inference on Phi-3-mini and Phi-4-mini returns bad outputs (random characters) when context gets larger than about 2000 tokens. This PR, contributed by @eaidova , fixes that. This is not my code. The original PR is no longer being updated; I'm making this a new PR to make it easier to discuss and add updates.

I saw no negative impact on inference speed. I see slightly different outputs with shorter contexts on SPR (on inference with the model exported with the PR vs the model exported with main). Any suggestions to fix that would be much appreciated.

Draft PR for now, awaiting some feedback and testing, but I hope we can merge this soon.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@helena-intel helena-intel added the openvino-slow Runs OpenVINO slow tests with different versions of transformers label Oct 30, 2025
return attn_output, None, past_key_value


# @torch.jit.script
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes sense to add the test with long prompt. Is this issue reproduced on tiny-model?

@nikita-savelyevv
Copy link
Collaborator

I see slightly different outputs with shorter contexts on SPR (on inference with the model exported with the PR vs the model exported with main).

I believe minor differences are expected on SPR. But if possible, WWB similarity should be run to see if the difference is significant or not.

@helena-intel helena-intel marked this pull request as ready for review October 31, 2025 10:24
return attn_output, None, past_key_value


# @torch.jit.script
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please remove not needed comments and commented out code

):
self._model.config.max_position_embeddings = self._model.config.original_max_position_embeddings

# currently, long RoPE can not be traced for long context support, disable it to avoid potential accuracy issues
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now I think we don't need this comment. We solve the problem by this PR.

if hasattr(self, "max_position_embeddings")
else self.config.max_position_embeddings
)
inv_freq = select_ext_factor(seq_len, original_max_position_embeddings, self.inv_freq, self.long_inv_freq)
Copy link
Collaborator

@rkazants rkazants Nov 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let us add a comment:

slow down all frequencies by scale factor for long prompts that makes attention more stable, i.e. preserve model accuracy

elif self._model.config.max_position_embeddings != getattr(
self._model.config, "original_max_position_embeddings", self._model.config.max_position_embeddings
):
self._model.config.max_position_embeddings = self._model.config.original_max_position_embeddings
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we save original value for max_position_embeddings and recover it in __exit__ method?

logits_to_keep=None,
**kwargs,
):
# Overwritten -- this model may need to switch between short and long rope, invalidating the cache in the
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I correct that we have a problem when we have short and long prompts in consecutive generate calls?
We can't re-initialize inv_freqs from long_inv_freqs to short_inv_freqs and vise-versa? How this problem is solved?

self._model.model._orig_forward = self._model.model.forward
self._model.model.forward = types.MethodType(phi3_442_forward, self._model.model)

# init inv_freq for torchscript tracing for PhiMoE
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

strange comment about torchscript tracing. Please revise.

Copy link
Collaborator

@echarlaix echarlaix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot @helena-intel !!



class OVPhi3ForCausalLM(OVModelForCausalLM):
def prepare_inputs_for_generation(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines -1593 to +1648
super().__enter__()
# Call OVDecoderModelPatcher.__enter__() directly to skip Phi3ModelPatcher's longrope logic
# PhiMoE has a different rotary embedding structure, longrope is not yet supported
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to add all this modifications to PhiMoEModelPatcher? (if longrope is not yet supported then self._model.model.rotary_emb will never be set to "longrope") If we want to make sure we can raise an error in case it's ever the case

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initially tests failed for phi_moe, see https://github.com/huggingface/optimum-intel/actions/runs/18952102871/job/54119192964 . We should have longrope support for the MoE model too but not in this PR. I would be happy with a simpler solution to not enable longrope for the MoE model (but still have it working as it is now).

return torch.where(seq_len <= max_pos_embeddings, short_factor, long_factor)


def long_rope(self, x, position_ids, seq_len=None):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scaling_factor = 1.0
else:
scaling_factor = math.sqrt(1 + math.log(scale) / math.log(original_max_position_embeddings))
cos = emb.cos() * scaling_factor
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +1519 to +1522
# Force float32 since bfloat16 loses precision on long contexts
# See https://github.com/huggingface/transformers/pull/29285
device_type = x.device.type
device_type = device_type if isinstance(device_type, str) and device_type != "mps" else "cpu"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not used and should we ensure fp32 dtype also ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

openvino-slow Runs OpenVINO slow tests with different versions of transformers

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants