-
Notifications
You must be signed in to change notification settings - Fork 289
Add Phi-4 Backbone #2272
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Add Phi-4 Backbone #2272
Conversation
I uploaded the output from the KerasHub model, could someone upload a HF version that I can compare and add to the Colab? |
P.S. A lot of this code is based on the existing code for Phi-3 (the technical report states it mostly follows the Phi-3-Medium architecture; I simply made the changes from the report and the reference implementation). Should I refactor it to inherit from |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! added my review comments.
'`rope_scaling_type` must be `None` or `"su"`.' | ||
"if `None` is choosed, `RotaryEmbedding` will be used." | ||
'if `"su"` is choosed, `Phi4SuScaledRotaryEmbedding` will be ' | ||
"used." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May be change this to --> "rope_scaling_type must be None or su. If None, RotaryEmbedding will be used. If su, Phi4SuScaledRotaryEmbedding will be used."
Add backtick wherever it is necessary.
vocabulary_size (int): The size of the token vocabulary. Defaults to | ||
`100_352`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change this to --> vocabulary_size: int. The size of the token vocabulary. Defaults to `100_352`.
Follow the above arg pattern for others as well, i know this follows same as phi3, but this will be consistent with majority of our models.
@pytest.mark.extra_large | ||
def test_all_presets(self): | ||
for preset in Phi4Backbone.presets: | ||
self.run_preset_test( | ||
cls=Phi4Backbone, | ||
preset=preset, | ||
input_data=self.input_data, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usually how big these models will be and how many presets are we testing here?
hidden_dim=5120, | ||
intermediate_dim=17_920, | ||
num_query_heads=40, | ||
num_key_value_heads=10, | ||
activation="silu", | ||
layer_norm_epsilon=1e-5, | ||
kernel_initializer="glorot_uniform", | ||
dropout=0, | ||
max_sequence_length=16_384, | ||
pretraining_sequence_length=16_384, | ||
rope_max_wavelength=250_000, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these default values are mostly common for all the presets in phi-4, if not may be we can remove default values?
# TODO: Deprecate this in favor of | ||
# `keras.layers.LayerNormalization(rms_scaling=True)` once Keras 2 support is | ||
# removed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't have keras 2 support now, either update the code or remove/update this comment.
How much similar is Phi-4 compared to Phi-3? What is the approx percentage of code we can reuse? |
I guess still Tokenizer and CausalLM and preset file with necessary test files still needs to be added? |
Actually, I think we might get away with directly subclassing
The only differences are the model name and the defaults; initially I did this copy anticipating architectural changes, but it seems the only ones are in the attention. From the paper's Section 3:
I could not find this sliding window attention in the code, however, so that also remained unchanged, and the tokenizer would be part of the third PR (based on the contributing guidelines). Do you think it's better if I just did that instead? |
Yeah this is an interesting question if the only thing that is really changing is the tokenizer. Would it work to just subclass all the phi3 classes as stubs that only update the classes? Like this @keras_hub_export("keras_hub.models.Phi4CausalLM")
class Phi4CausalLM(Phi3CausalLM):
backbone_cls = Phi4Backbone
preprocessor_cls = Phi4CausalLMPreprocessor And then define a new tokenizer? I don't think we need to worry about switchin the defaults, that we can just reflect in the preset configs we upload to kaggle. Might be worth trying all of that on a single PR and trying to convert a model to see if everything works. Since we'd mostly be dealing with stub classes it shouldn't be too much code. |
Yes, that makes sense to me. I'll try getting it out over the weekend. |
@yrahul3910 - are you still working on this? |
Yes, so sorry for the delay! I'll push some commits soon. |
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces the Phi-4 backbone and its associated components. The implementation is solid and well-structured. I've identified a few minor areas for improvement, mainly related to typos and redundant code. Addressing these will improve the clarity and maintainability of the new model.
rope_scaling_short_factor List[float]: List of factors used to adjust | ||
rope frequencies when the `rope_scaling_type` is `"su"`. List must | ||
be of length `hidden_dim//num_query_heads//2`. It is used when | ||
`sequence_length` is smaller than `original_max_sequence_length`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the docstring for rope_scaling_short_factor
, original_max_sequence_length
is used. This should be pretraining_sequence_length
to align with the parameter name and its usage in the rotary embedding implementation. Apply this same correction to the docstring for rope_scaling_long_factor
on line 65.
`sequence_length` is smaller than `original_max_sequence_length`. | |
`sequence_length` is smaller than `pretraining_sequence_length`. |
self.max_sequence_length = max_sequence_length | ||
self.pretraining_sequence_length = pretraining_sequence_length | ||
self.rope_max_wavelength = rope_max_wavelength | ||
self.rope_scaling_type = rope_scaling_type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
inverese_freq_short_factor=self.rope_scaling_short_factor, | ||
inverese_freq_long_factor=self.rope_scaling_long_factor, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo in the keyword arguments inverese_freq_short_factor
and inverese_freq_long_factor
. They should be inverse_freq_short_factor
and inverse_freq_long_factor
respectively. This change is dependent on correcting the corresponding typo in the Phi4SuScaledRotaryEmbedding
class definition first.
inverese_freq_short_factor=self.rope_scaling_short_factor, | |
inverese_freq_long_factor=self.rope_scaling_long_factor, | |
inverse_freq_short_factor=self.rope_scaling_short_factor, | |
inverse_freq_long_factor=self.rope_scaling_long_factor, |
inverese_freq_short_factor List[float]: List of factors used to adjust | ||
rope frequencies when the `rope_scaling_type` is `"su"`. List must | ||
be of length `hidden_dim//num_query_heads//2`. It is used when | ||
`sequence_length` is smaller than `original_max_sequence_length`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the docstring, original_max_sequence_length
is mentioned. This should be pretraining_sequence_length
to match the class's parameter and its usage. This correction should also be applied on line 19.
`sequence_length` is smaller than `original_max_sequence_length`. | |
`sequence_length` is smaller than `pretraining_sequence_length`. |
inverese_freq_short_factor, | ||
inverese_freq_long_factor, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo in the parameter names: inverese
should be inverse
. This affects inverese_freq_short_factor
and inverese_freq_long_factor
.
Correct this here and propagate throughout the class in the following places:
- The
__init__
method body (lines 64-65) - The
_compute_cos_sin_embedding
method (lines 78 and 83) - The
get_config
method (lines 120-121) - The docstring (lines 12 and 16)
Fixing this will improve API clarity and code maintainability.
inverese_freq_short_factor, | |
inverese_freq_long_factor, | |
inverse_freq_short_factor, | |
inverse_freq_long_factor, |
Description of the change
This is the first PR in contributing the Phi-4 model to KerasHub, and includes the backbone and its test file.
Reference
Colab Notebook
I've had some trouble getting this part to work, so I need some help. This is my Colab notebook, but the HF model has been pretty annoying to run. On CPU machines, it seems to constantly allocate all available memory (I gave up after giving it 280GB), and on an H200 on Modal, I couldn't get an output after 15 minutes. In the notebook, this line:
at the bottom is the one I have trouble with.
Checklist