Add Phi-4 Backbone #2272

yrahul3910 · 2025-05-27T04:35:06Z

Description of the change

This is the first PR in contributing the Phi-4 model to KerasHub, and includes the backbone and its test file.

Reference

Colab Notebook

I've had some trouble getting this part to work, so I need some help. This is my Colab notebook, but the HF model has been pretty annoying to run. On CPU machines, it seems to constantly allocate all available memory (I gave up after giving it 280GB), and on an H200 on Modal, I couldn't get an output after 15 minutes. In the notebook, this line:

hf_output = pt_model(**hf_sample_input)

at the bottom is the one I have trouble with.

Checklist

I have added all the necessary unit tests for my change.
I have verified that my change does not break existing code and works with all backends (TensorFlow, JAX, and PyTorch).
My PR is based on the latest changes of the main branch (if unsure, rebase the code).
I have followed the Keras Hub Model contribution guidelines in making these changes.
I have followed the Keras Hub API design guidelines in making these changes.
I have signed the Contributor License Agreement.

yrahul3910 · 2025-05-27T04:39:06Z

I uploaded the output from the KerasHub model, could someone upload a HF version that I can compare and add to the Colab?

yrahul3910 · 2025-05-27T04:41:44Z

P.S. A lot of this code is based on the existing code for Phi-3 (the technical report states it mostly follows the Phi-3-Medium architecture; I simply made the changes from the report and the reference implementation). Should I refactor it to inherit from Phi3Backbone?

sachinprasadhs

Thanks for the PR! added my review comments.

sachinprasadhs · 2025-05-30T18:00:35Z

keras_hub/src/models/phi4/phi4_attention.py

+                '`rope_scaling_type` must be `None` or `"su"`.'
+                "if `None` is choosed, `RotaryEmbedding` will be used."
+                'if `"su"` is choosed, `Phi4SuScaledRotaryEmbedding` will be '
+                "used."


May be change this to --> "rope_scaling_type must be None or su. If None, RotaryEmbedding will be used. If su, Phi4SuScaledRotaryEmbedding will be used."
Add backtick wherever it is necessary.

sachinprasadhs · 2025-05-30T18:48:47Z

keras_hub/src/models/phi4/phi4_backbone.py

+        vocabulary_size (int): The size of the token vocabulary. Defaults to
+            `100_352`.


Change this to --> vocabulary_size: int. The size of the token vocabulary. Defaults to `100_352`.

Follow the above arg pattern for others as well, i know this follows same as phi3, but this will be consistent with majority of our models.

sachinprasadhs · 2025-05-30T20:01:52Z

keras_hub/src/models/phi4/phi4_backbone_test.py

+    @pytest.mark.extra_large
+    def test_all_presets(self):
+        for preset in Phi4Backbone.presets:
+            self.run_preset_test(
+                cls=Phi4Backbone,
+                preset=preset,
+                input_data=self.input_data,
+            )


Usually how big these models will be and how many presets are we testing here?

sachinprasadhs · 2025-05-30T20:03:54Z

keras_hub/src/models/phi4/phi4_decoder.py

+        hidden_dim=5120,
+        intermediate_dim=17_920,
+        num_query_heads=40,
+        num_key_value_heads=10,
+        activation="silu",
+        layer_norm_epsilon=1e-5,
+        kernel_initializer="glorot_uniform",
+        dropout=0,
+        max_sequence_length=16_384,
+        pretraining_sequence_length=16_384,
+        rope_max_wavelength=250_000,


Are these default values are mostly common for all the presets in phi-4, if not may be we can remove default values?

sachinprasadhs · 2025-05-30T20:06:46Z

keras_hub/src/models/phi4/phi4_layernorm.py

+# TODO: Deprecate this in favor of
+# `keras.layers.LayerNormalization(rms_scaling=True)` once Keras 2 support is
+# removed.


We don't have keras 2 support now, either update the code or remove/update this comment.

sachinprasadhs · 2025-05-30T20:11:56Z

P.S. A lot of this code is based on the existing code for Phi-3 (the technical report states it mostly follows the Phi-3-Medium architecture; I simply made the changes from the report and the reference implementation). Should I refactor it to inherit from Phi3Backbone?

How much similar is Phi-4 compared to Phi-3? What is the approx percentage of code we can reuse?

sachinprasadhs · 2025-05-30T20:55:30Z

I guess still Tokenizer and CausalLM and preset file with necessary test files still needs to be added?

yrahul3910 · 2025-06-01T19:38:56Z

Actually, I think we might get away with directly subclassing Phi3Backbone and changing the defaults. If you do

diff src/models/phi3/phi3_backbone.py src/models/phi4/phi4_backbone.py

The only differences are the model name and the defaults; initially I did this copy anticipating architectural changes, but it seems the only ones are in the attention. From the paper's Section 3:

The architecture closely follows phi-3-medium, except that we now use the tiktoken tokenizer (for better
multilingual support) with a padded vocabulary size of 100,352 (including unused tokens) and we use
full attention over the 4K context length, rather than a 2K sliding window used in phi-3-medium.

I could not find this sliding window attention in the code, however, so that also remained unchanged, and the tokenizer would be part of the third PR (based on the contributing guidelines). Do you think it's better if I just did that instead?

mattdangerw · 2025-06-13T19:35:51Z

Yeah this is an interesting question if the only thing that is really changing is the tokenizer. Would it work to just subclass all the phi3 classes as stubs that only update the classes? Like this

@keras_hub_export("keras_hub.models.Phi4CausalLM")
class Phi4CausalLM(Phi3CausalLM):
    backbone_cls = Phi4Backbone
    preprocessor_cls = Phi4CausalLMPreprocessor

And then define a new tokenizer? I don't think we need to worry about switchin the defaults, that we can just reflect in the preset configs we upload to kaggle.

Might be worth trying all of that on a single PR and trying to convert a model to see if everything works. Since we'd mostly be dealing with stub classes it shouldn't be too much code.

yrahul3910 · 2025-06-13T20:36:31Z

Yes, that makes sense to me. I'll try getting it out over the weekend.

abheesht17 · 2025-07-10T19:07:01Z

@yrahul3910 - are you still working on this?

yrahul3910 · 2025-07-10T20:12:49Z

Yes, so sorry for the delay! I'll push some commits soon.

divyashreepathihalli · 2025-07-10T23:59:53Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces the Phi-4 backbone and its associated components. The implementation is solid and well-structured. I've identified a few minor areas for improvement, mainly related to typos and redundant code. Addressing these will improve the clarity and maintainability of the new model.

gemini-code-assist · 2025-07-11T00:01:24Z

keras_hub/src/models/phi4/phi4_backbone.py

+        rope_scaling_short_factor List[float]: List of factors used to adjust
+            rope frequencies when the `rope_scaling_type` is `"su"`. List must
+            be of length `hidden_dim//num_query_heads//2`. It is used when
+            `sequence_length` is smaller than `original_max_sequence_length`.


In the docstring for rope_scaling_short_factor, original_max_sequence_length is used. This should be pretraining_sequence_length to align with the parameter name and its usage in the rotary embedding implementation. Apply this same correction to the docstring for rope_scaling_long_factor on line 65.

Suggested change

`sequence_length` is smaller than `original_max_sequence_length`.

`sequence_length` is smaller than `pretraining_sequence_length`.

gemini-code-assist · 2025-07-11T00:01:25Z

keras_hub/src/models/phi4/phi4_backbone.py

+        self.max_sequence_length = max_sequence_length
+        self.pretraining_sequence_length = pretraining_sequence_length
+        self.rope_max_wavelength = rope_max_wavelength
+        self.rope_scaling_type = rope_scaling_type


The attribute self.rope_scaling_type is assigned twice in the __init__ method. This line is a duplicate of the assignment on line 187 and can be removed.

gemini-code-assist · 2025-07-11T00:01:25Z

keras_hub/src/models/phi4/phi4_attention.py

+                inverese_freq_short_factor=self.rope_scaling_short_factor,
+                inverese_freq_long_factor=self.rope_scaling_long_factor,


Typo in the keyword arguments inverese_freq_short_factor and inverese_freq_long_factor. They should be inverse_freq_short_factor and inverse_freq_long_factor respectively. This change is dependent on correcting the corresponding typo in the Phi4SuScaledRotaryEmbedding class definition first.

Suggested change

inverese_freq_short_factor=self.rope_scaling_short_factor,

inverese_freq_long_factor=self.rope_scaling_long_factor,

inverse_freq_short_factor=self.rope_scaling_short_factor,

inverse_freq_long_factor=self.rope_scaling_long_factor,

gemini-code-assist · 2025-07-11T00:01:25Z

keras_hub/src/models/phi4/phi4_rotary_embedding.py

+        inverese_freq_short_factor List[float]: List of factors used to adjust
+            rope frequencies when the `rope_scaling_type` is `"su"`. List must
+            be of length `hidden_dim//num_query_heads//2`. It is used when
+            `sequence_length` is smaller than `original_max_sequence_length`.


In the docstring, original_max_sequence_length is mentioned. This should be pretraining_sequence_length to match the class's parameter and its usage. This correction should also be applied on line 19.

Suggested change

`sequence_length` is smaller than `original_max_sequence_length`.

`sequence_length` is smaller than `pretraining_sequence_length`.

gemini-code-assist · 2025-07-11T00:01:25Z

keras_hub/src/models/phi4/phi4_rotary_embedding.py

+        inverese_freq_short_factor,
+        inverese_freq_long_factor,


Typo in the parameter names: inverese should be inverse. This affects inverese_freq_short_factor and inverese_freq_long_factor.

Correct this here and propagate throughout the class in the following places:

The __init__ method body (lines 64-65)

The _compute_cos_sin_embedding method (lines 78 and 83)

The get_config method (lines 120-121)

The docstring (lines 12 and 16)

Fixing this will improve API clarity and code maintainability.

Suggested change

inverese_freq_short_factor,

inverese_freq_long_factor,

inverse_freq_short_factor,

inverse_freq_long_factor,

yrahul3910 and others added 3 commits May 7, 2025 21:54

feat(phi4): add phi4_backbone

4a8566b

docs(phi4): update defaults in docstring

69f66ff

Merge branch 'keras-team:master' into master

1bfe756

divyashreepathihalli requested a review from sachinprasadhs May 28, 2025 02:47

sachinprasadhs reviewed May 30, 2025

View reviewed changes

sachinprasadhs added the stat:awaiting response from contributor label May 30, 2025

sachinprasadhs removed the stat:awaiting response from contributor label Jun 16, 2025

divyashreepathihalli added this to KerasHub Jul 10, 2025

divyashreepathihalli moved this to In Progress in KerasHub Jul 10, 2025

gemini-code-assist bot reviewed Jul 11, 2025

View reviewed changes

		vocabulary_size (int): The size of the token vocabulary. Defaults to
		`100_352`.

	`sequence_length` is smaller than `original_max_sequence_length`.
	`sequence_length` is smaller than `pretraining_sequence_length`.

		inverese_freq_short_factor=self.rope_scaling_short_factor,
		inverese_freq_long_factor=self.rope_scaling_long_factor,

Add Phi-4 Backbone #2272

Are you sure you want to change the base?

Add Phi-4 Backbone #2272

Conversation

yrahul3910 commented May 27, 2025

Description of the change

Reference

Colab Notebook

Checklist

Uh oh!

yrahul3910 commented May 27, 2025

Uh oh!

yrahul3910 commented May 27, 2025

Uh oh!

sachinprasadhs left a comment

Choose a reason for hiding this comment

Uh oh!

sachinprasadhs May 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sachinprasadhs May 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sachinprasadhs May 30, 2025

Choose a reason for hiding this comment

Uh oh!

sachinprasadhs May 30, 2025

Choose a reason for hiding this comment

Uh oh!

sachinprasadhs May 30, 2025

Choose a reason for hiding this comment

Uh oh!

sachinprasadhs commented May 30, 2025

Uh oh!

sachinprasadhs commented May 30, 2025

Uh oh!

yrahul3910 commented Jun 1, 2025

Uh oh!

mattdangerw commented Jun 13, 2025

Uh oh!

yrahul3910 commented Jun 13, 2025

Uh oh!

abheesht17 commented Jul 10, 2025

Uh oh!

yrahul3910 commented Jul 10, 2025

Uh oh!

divyashreepathihalli commented Jul 10, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jul 11, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jul 11, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jul 11, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jul 11, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jul 11, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sachinprasadhs May 30, 2025 •

edited

Loading

sachinprasadhs May 30, 2025 •

edited

Loading