-
Notifications
You must be signed in to change notification settings - Fork 153
Fix Phi long context issue #1504
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
| return attn_output, None, past_key_value | ||
|
|
||
|
|
||
| # @torch.jit.script |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it makes sense to add the test with long prompt. Is this issue reproduced on tiny-model?
I believe minor differences are expected on SPR. But if possible, WWB similarity should be run to see if the difference is significant or not. |
| return attn_output, None, past_key_value | ||
|
|
||
|
|
||
| # @torch.jit.script |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please remove not needed comments and commented out code
| ): | ||
| self._model.config.max_position_embeddings = self._model.config.original_max_position_embeddings | ||
|
|
||
| # currently, long RoPE can not be traced for long context support, disable it to avoid potential accuracy issues |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
now I think we don't need this comment. We solve the problem by this PR.
| if hasattr(self, "max_position_embeddings") | ||
| else self.config.max_position_embeddings | ||
| ) | ||
| inv_freq = select_ext_factor(seq_len, original_max_position_embeddings, self.inv_freq, self.long_inv_freq) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let us add a comment:
slow down all frequencies by scale factor for long prompts that makes attention more stable, i.e. preserve model accuracy
| elif self._model.config.max_position_embeddings != getattr( | ||
| self._model.config, "original_max_position_embeddings", self._model.config.max_position_embeddings | ||
| ): | ||
| self._model.config.max_position_embeddings = self._model.config.original_max_position_embeddings |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we save original value for max_position_embeddings and recover it in __exit__ method?
| logits_to_keep=None, | ||
| **kwargs, | ||
| ): | ||
| # Overwritten -- this model may need to switch between short and long rope, invalidating the cache in the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Am I correct that we have a problem when we have short and long prompts in consecutive generate calls?
We can't re-initialize inv_freqs from long_inv_freqs to short_inv_freqs and vise-versa? How this problem is solved?
| self._model.model._orig_forward = self._model.model.forward | ||
| self._model.model.forward = types.MethodType(phi3_442_forward, self._model.model) | ||
|
|
||
| # init inv_freq for torchscript tracing for PhiMoE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
strange comment about torchscript tracing. Please revise.
echarlaix
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot @helena-intel !!
|
|
||
|
|
||
| class OVPhi3ForCausalLM(OVModelForCausalLM): | ||
| def prepare_inputs_for_generation( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would you mind adding a link to the original code
https://github.com/huggingface/transformers/blob/v4.57.0/src/transformers/models/phi3/modeling_phi3.py#L493
| super().__enter__() | ||
| # Call OVDecoderModelPatcher.__enter__() directly to skip Phi3ModelPatcher's longrope logic | ||
| # PhiMoE has a different rotary embedding structure, longrope is not yet supported |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need to add all this modifications to PhiMoEModelPatcher? (if longrope is not yet supported then self._model.model.rotary_emb will never be set to "longrope") If we want to make sure we can raise an error in case it's ever the case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initially tests failed for phi_moe, see https://github.com/huggingface/optimum-intel/actions/runs/18952102871/job/54119192964 . We should have longrope support for the MoE model too but not in this PR. I would be happy with a simpler solution to not enable longrope for the MoE model (but still have it working as it is now).
| return torch.where(seq_len <= max_pos_embeddings, short_factor, long_factor) | ||
|
|
||
|
|
||
| def long_rope(self, x, position_ids, seq_len=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would you mind adding a link to original code (https://github.com/huggingface/transformers/blob/v4.57.1/src/transformers/models/phi3/modeling_phi3.py#L324 ?)
| scaling_factor = 1.0 | ||
| else: | ||
| scaling_factor = math.sqrt(1 + math.log(scale) / math.log(original_max_position_embeddings)) | ||
| cos = emb.cos() * scaling_factor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here we can't use self.attention_scaling ? https://github.com/huggingface/transformers/blob/63fbd50fb4ff7b586ab1b59b67f7464e62f9df69/src/transformers/modeling_rope_utils.py#L519
| # Force float32 since bfloat16 loses precision on long contexts | ||
| # See https://github.com/huggingface/transformers/pull/29285 | ||
| device_type = x.device.type | ||
| device_type = device_type if isinstance(device_type, str) and device_type != "mps" else "cpu" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not used and should we ensure fp32 dtype also ?
This is #1297 updated to latest main branch.
Currently inference on Phi-3-mini and Phi-4-mini returns bad outputs (random characters) when context gets larger than about 2000 tokens. This PR, contributed by @eaidova , fixes that. This is not my code. The original PR is no longer being updated; I'm making this a new PR to make it easier to discuss and add updates.
I saw no negative impact on inference speed. I see slightly different outputs with shorter contexts on SPR (on inference with the model exported with the PR vs the model exported with main). Any suggestions to fix that would be much appreciated.
Draft PR for now, awaiting some feedback and testing, but I hope we can merge this soon.