Add Qwen3 Moe #2260

kanpuriyanawab · 2025-05-19T16:17:06Z

No description provided.

mattdangerw

Thanks! Took an initial pass. Let's try to clean up the config and state passing.

No passing an index down the layer stack, plus data structures that apply to the whole layer stack.

keras_hub/src/models/qwen3_moe/qwen3_causal_lm_preprocessor.py

keras_hub/src/models/qwen3_moe/qwen3_moe_attention.py

mattdangerw · 2025-05-20T15:39:46Z

keras_hub/src/models/qwen3_moe/qwen3_moe_attention.py

+        self,
+        num_query_heads,
+        num_key_value_heads,
+        layer_index,


This layer index is gross, let's remove it. Handle the args properly in the backbone and pass the correct sliding_window_size to this layer and the decoder layer above it.

since it's an Moe, layer index is not just used for sliding window but for experts

I replaced this passing of layer_index, decoder_sparse_step and mlp_only_layers with a single boolean switch:

https://github.com/kanpuriyanawab/keras-nlp/blob/730a9c41e95a74906a041d8933b8d7738391b438/keras_hub/src/models/qwen3_moe/qwen3_moe_backbone.py#L129-L156

keras_hub/src/models/qwen3_moe/qwen3_moe_attention.py

keras_hub/src/models/qwen3_moe/qwen3_moe_backbone.py

mattdangerw · 2025-05-20T15:47:09Z

keras_hub/src/models/qwen3_moe/qwen3_moe_backbone.py

+    model(input_data)
+    """
+
+    def __init__(


In general, let's make sure we prune this list down just to the config options we need.

mattdangerw · 2025-05-20T15:47:58Z

keras_hub/src/models/qwen3_moe/qwen3_moe_backbone.py

+        sliding_window_size=32768,
+        output_router_logits=False,
+        router_aux_loss_coefficient=0.001,
+        mlp_only_layers=[],


Fine to have something like this for the toplevel, but let's pass something more direct to each decoder layer (so we don't need to pass the index down). Make sure to document if we keep it.

but let's pass something more direct to each decoder layer

what do you suggest?

keras_hub/src/models/qwen3_moe/qwen3_moe_decoder.py

divyashreepathihalli · 2025-07-11T00:00:37Z

/gemini review

gemini-code-assist

Code Review

This pull request adds support for the Qwen3 MoE model. The implementation looks solid, covering the backbone, attention, decoder, tokenizer, and conversion scripts. I've identified several high-severity issues related to incomplete get_config methods in various new layers, which will prevent model serialization from working correctly. There are also some medium-severity issues like unused parameters and a critical issue in the checkpoint conversion test script where an incorrect preprocessor is used. I've provided suggestions to fix these issues. Once addressed, the PR should be in great shape.

gemini-code-assist · 2025-07-11T00:02:25Z

tools/checkpoint_conversion/convert_qwen3_moe_checkpoints.py

+    keras_hub_preprocessor = keras_hub.models.QwenCausalLMPreprocessor(
+        keras_hub_tokenizer
+    )


The test is using keras_hub.models.QwenCausalLMPreprocessor, which is for the Qwen2 model. This test should use the newly added Qwen3MoeCausalLMPreprocessor to correctly test the Qwen3 MoE model components.

keras_hub_preprocessor = keras_hub.models.Qwen3MoeCausalLMPreprocessor( keras_hub_tokenizer )

gemini-code-assist · 2025-07-11T00:02:25Z

tools/checkpoint_conversion/convert_qwen3_moe_checkpoints.py

+    keras_hub_preprocessor = keras_hub.models.QwenCausalLMPreprocessor(
+        keras_hub_tokenizer
+    )


Similar to the test_model function, test_tokenizer is using the incorrect preprocessor QwenCausalLMPreprocessor. It should be Qwen3MoeCausalLMPreprocessor.

keras_hub_preprocessor = keras_hub.models.Qwen3MoeCausalLMPreprocessor( keras_hub_tokenizer )

gemini-code-assist · 2025-07-11T00:02:25Z

keras_hub/src/models/qwen3_moe/qwen3_moe_attention.py

+    def get_config(self):
+        config = super().get_config()
+        config.update(
+            {
+                "num_query_heads": self.num_query_heads,
+                "num_key_value_heads": self.num_key_value_heads,
+                "rope_max_wavelength": self.rope_max_wavelength,
+                "rope_scaling_factor": self.rope_scaling_factor,
+                "kernel_initializer": keras.initializers.serialize(
+                    self.kernel_initializer
+                ),
+                "dropout": self.dropout,
+                "sliding_window_size": self.sliding_window_size,
+            }
+        )
+        return config


The get_config method is missing several parameters that are defined in __init__ and used in the layer: layer_index, head_dim, and layer_norm_epsilon. Without these, the layer cannot be correctly serialized and deserialized, which will break model saving and loading.

Suggested change

def get_config(self):

config = super().get_config()

config.update(

{

"num_query_heads": self.num_query_heads,

"num_key_value_heads": self.num_key_value_heads,

"rope_max_wavelength": self.rope_max_wavelength,

"rope_scaling_factor": self.rope_scaling_factor,

"kernel_initializer": keras.initializers.serialize(

self.kernel_initializer

),

"dropout": self.dropout,

"sliding_window_size": self.sliding_window_size,

}

)

return config

def get_config(self):

config = super().get_config()

config.update(

{

"num_query_heads": self.num_query_heads,

"num_key_value_heads": self.num_key_value_heads,

"rope_max_wavelength": self.rope_max_wavelength,

"rope_scaling_factor": self.rope_scaling_factor,

"kernel_initializer": keras.initializers.serialize(

self.kernel_initializer

),

"dropout": self.dropout,

"sliding_window_size": self.sliding_window_size,

"layer_index": self.layer_index,

"head_dim": self.head_dim,

"layer_norm_epsilon": self.layer_norm_epsilon,

}

)

return config

gemini-code-assist · 2025-07-11T00:02:25Z

keras_hub/src/models/qwen3_moe/qwen3_moe_backbone.py

+                "mlp_only_layers": self.mlp_only_layers,
+            }


The get_config method is missing router_aux_loss_coefficient. This parameter is used to initialize Qwen3MoeTransformerDecoder and is essential for model serialization and deserialization.

"mlp_only_layers": self.mlp_only_layers, "router_aux_loss_coefficient": self.router_aux_loss_coefficient, }

gemini-code-assist · 2025-07-11T00:02:26Z

keras_hub/src/models/qwen3_moe/qwen3_moe_decoder.py

+        config = super().get_config()
+        config.update(
+            {
+                "num_query_heads": self.num_query_heads,
+                "intermediate_dim": self.intermediate_dim,
+                "moe_intermediate_dim": self.moe_intermediate_dim,
+                "rope_max_wavelength": self.rope_max_wavelength,
+                "num_key_value_heads": self.num_key_value_heads,
+                "rope_scaling_factor": self.rope_scaling_factor,
+                "layer_norm_epsilon": self.layer_norm_epsilon,
+                "dropout": self.dropout,
+                "sliding_window_size": self.sliding_window_size,
+                "num_experts": self.num_experts,
+                "top_k": self.top_k,
+                "norm_top_k_prob": self.norm_top_k_prob,
+                "router_aux_loss_coefficient": self.router_aux_loss_coefficient,
+            }
+        )
+        return config


The get_config method is missing several parameters from __init__: head_dim, is_sparse_mlp, activation, kernel_initializer, and layer_index. These are all necessary to correctly serialize and deserialize the layer.

config = super().get_config() config.update( { "num_query_heads": self.num_query_heads, "intermediate_dim": self.intermediate_dim, "moe_intermediate_dim": self.moe_intermediate_dim, "rope_max_wavelength": self.rope_max_wavelength, "num_key_value_heads": self.num_key_value_heads, "rope_scaling_factor": self.rope_scaling_factor, "layer_norm_epsilon": self.layer_norm_epsilon, "dropout": self.dropout, "sliding_window_size": self.sliding_window_size, "num_experts": self.num_experts, "top_k": self.top_k, "norm_top_k_prob": self.norm_top_k_prob, "router_aux_loss_coefficient": self.router_aux_loss_coefficient, "head_dim": self.head_dim, "is_sparse_mlp": self.is_sparse_mlp, "activation": keras.activations.serialize(self.activation), "kernel_initializer": keras.initializers.serialize( self.kernel_initializer ), "layer_index": self.layer_index, } ) return config

gemini-code-assist · 2025-07-11T00:02:26Z

keras_hub/src/utils/transformers/convert_qwen3_moe.py

+def convert_backbone_config(transformers_config):
+    return {
+        "vocabulary_size": transformers_config["vocab_size"],
+        "hidden_dim": transformers_config["hidden_size"],
+        "head_dim": transformers_config["head_dim"],
+        "num_layers": transformers_config["num_hidden_layers"],
+        "num_query_heads": transformers_config["num_attention_heads"],
+        "num_key_value_heads": transformers_config["num_key_value_heads"],
+        "intermediate_dim": transformers_config["intermediate_size"],
+        "moe_intermediate_dim": transformers_config["moe_intermediate_size"],
+        "num_experts": transformers_config["num_experts"],
+        "top_k": transformers_config["num_experts_per_tok"],
+        "norm_top_k_prob": transformers_config["norm_topk_prob"],
+        "decoder_sparse_step": transformers_config["decoder_sparse_step"],
+        "layer_norm_epsilon": transformers_config["rms_norm_eps"],
+        "rope_max_wavelength": transformers_config["rope_theta"],
+        "use_sliding_window": transformers_config["use_sliding_window"],
+        "sliding_window_size": transformers_config["sliding_window"],
+        "output_router_logits": transformers_config["output_router_logits"],
+        "router_aux_loss_coefficient": transformers_config[
+            "router_aux_loss_coef"
+        ],
+    }


The convert_backbone_config function has a couple of issues:

It extracts use_sliding_window and output_router_logits from the Hugging Face config, but these are not used by Qwen3MoeBackbone.

It's missing tie_word_embeddings, which is crucial for correct weight loading. The convert_weights function depends on this value.

These can be fixed by updating the returned dictionary.

def convert_backbone_config(transformers_config): return { "vocabulary_size": transformers_config["vocab_size"], "hidden_dim": transformers_config["hidden_size"], "head_dim": transformers_config["head_dim"], "num_layers": transformers_config["num_hidden_layers"], "num_query_heads": transformers_config["num_attention_heads"], "num_key_value_heads": transformers_config["num_key_value_heads"], "intermediate_dim": transformers_config["intermediate_size"], "moe_intermediate_dim": transformers_config["moe_intermediate_size"], "num_experts": transformers_config["num_experts"], "top_k": transformers_config["num_experts_per_tok"], "norm_top_k_prob": transformers_config["norm_topk_prob"], "decoder_sparse_step": transformers_config["decoder_sparse_step"], "layer_norm_epsilon": transformers_config["rms_norm_eps"], "sliding_window_size": transformers_config["sliding_window"], "router_aux_loss_coefficient": transformers_config[ "router_aux_loss_coef" ], "tie_word_embeddings": transformers_config.get("tie_word_embeddings", False), }

gemini-code-assist · 2025-07-11T00:02:26Z

keras_hub/src/models/qwen3_moe/qwen3_moe_attention.py

+        dropout=0,
+        layer_norm_epsilon=1e-5,
+        sliding_window_size=4096,
+        max_window_layers=28,


The __init__ method defines a parameter max_window_layers which is not used anywhere in the class. To improve code clarity and maintainability, it's best to remove unused parameters.

gemini-code-assist · 2025-07-11T00:02:26Z

keras_hub/src/models/qwen3_moe/qwen3_moe_layernorm.py

+
+    def get_config(self):
+        config = super().get_config()
+        config.update({"epsilon": self.epsilon})


The get_config method should also include hidden_dim. While it can be inferred during build, explicitly saving it makes deserialization more robust and clear.

config.update({"epsilon": self.epsilon, "hidden_dim": self.hidden_dim})

qwen3 moe init

e528378

kanpuriyanawab self-assigned this May 19, 2025

kanpuriyanawab added 2 commits May 20, 2025 14:43

bug fixes

84043a3

update

750412c

mattdangerw reviewed May 20, 2025

View reviewed changes

kanpuriyanawab and others added 5 commits May 24, 2025 09:16

Merge branch 'keras-team:master' into qwen3_moe

9b3d779

address comments

6b74171

address comments

730a9c4

update output matching script

5f90d10

fix test

cda9cfc

gemini-code-assist bot reviewed Jul 11, 2025

View reviewed changes

update qwen3 causal lm

1b21c7c

Add Qwen3 Moe #2260

Are you sure you want to change the base?

Add Qwen3 Moe #2260

Uh oh!

Conversation

kanpuriyanawab commented May 19, 2025

Uh oh!

mattdangerw left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

mattdangerw May 20, 2025

Choose a reason for hiding this comment

Uh oh!

kanpuriyanawab May 24, 2025

Choose a reason for hiding this comment

Uh oh!

kanpuriyanawab May 27, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mattdangerw May 20, 2025

Choose a reason for hiding this comment

Uh oh!

mattdangerw May 20, 2025

Choose a reason for hiding this comment

Uh oh!

kanpuriyanawab May 27, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

divyashreepathihalli commented Jul 11, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jul 11, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jul 11, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jul 11, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jul 11, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jul 11, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jul 11, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jul 11, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jul 11, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!