Add GLM4_MOE model support by vvvdwbvvv · Pull Request #952 · linkedin/Liger-Kernel

vvvdwbvvv · 2025-11-25T06:07:56Z

Summary

This PR adds support for GLM4.5 (GLM-4 MOE) models to the Liger Kernel #951
https://huggingface.co/zai-org/GLM-4.5 which share the same structure as GLM 4.6

Testing Done

For the convergence test on fp32, model size can easily leads to OOM, initially I was using 4090 to run the tests, however only fp32 encounters OOM, so I move forward to L40S to finish all the tests.

Hardware Type:
run make test to ensure correctness
run make checkstyle to ensure code style
run make test-convergence to ensure convergence

… mapping

…ed parameters

…nce_for_glm4_moe

kashif · 2025-11-25T10:46:00Z

src/liger_kernel/transformers/model/glm4_moe.py

+        skip_logits = self.training and (labels is not None or shift_labels is not None)
+
+    if skip_logits:
+        loss = LigerForCausalLMLoss(


kindly have a look at the other model examples and adapt to new API that returns the metric

vvvdwbvvv · 2025-11-25T12:56:01Z

Fixed in 5af9d16

Tcc0403 · 2025-12-15T19:05:34Z

src/liger_kernel/transformers/monkey_patch.py

+            if swiglu:
+                _patch_swiglu_module(decoder_layer.mlp, LigerSwiGLUMLP)
+            if rms_norm:
+                _patch_rms_norm_module(decoder_layer.input_layernorm)
+                _patch_rms_norm_module(decoder_layer.post_attention_layernorm)
+            # patch MOE layers
+            if isinstance(decoder_layer.mlp, Glm4MoeMoE):
+                experts = decoder_layer.mlp.experts
+                if experts is not None:
+                    for expert in experts:
+                        _patch_swiglu_module(expert, LigerSwiGLUMLP)
+
+                shared_experts = decoder_layer.mlp.shared_experts
+                if shared_experts is not None:
+                    _patch_swiglu_module(shared_experts, LigerSwiGLUMLP)


Suggested change

if swiglu:

_patch_swiglu_module(decoder_layer.mlp, LigerSwiGLUMLP)

if rms_norm:

_patch_rms_norm_module(decoder_layer.input_layernorm)

_patch_rms_norm_module(decoder_layer.post_attention_layernorm)

# patch MOE layers

if isinstance(decoder_layer.mlp, Glm4MoeMoE):

experts = decoder_layer.mlp.experts

if experts is not None:

for expert in experts:

_patch_swiglu_module(expert, LigerSwiGLUMLP)

shared_experts = decoder_layer.mlp.shared_experts

if shared_experts is not None:

_patch_swiglu_module(shared_experts, LigerSwiGLUMLP)

if swiglu:

if isinstance(decoder_layer.mlp, Glm4MoeMoE):

experts = decoder_layer.mlp.experts

if experts is not None:

for expert in experts:

_patch_swiglu_module(expert, LigerSwiGLUMLP)

shared_experts = decoder_layer.mlp.shared_experts

if shared_experts is not None:

_patch_swiglu_module(shared_experts, LigerSwiGLUMLP)

elif isinstance(decoder_layer.mlp, Glm4MoeMLP):

_patch_swiglu_module(decoder_layer.mlp, LigerSwiGLUMLP)

if rms_norm:

_patch_rms_norm_module(decoder_layer.input_layernorm)

_patch_rms_norm_module(decoder_layer.post_attention_layernorm)

You need to check whether this layer is MLP or MoE before patching, plus MoE patching should be under if swiglu scope.

Tcc0403 · 2025-12-15T19:08:03Z

test/transformers/test_monkey_patch.py

+        assert inspect.getsource(dummy_model_instance.forward) != inspect.getsource(glm4_moe_lce_forward)
+        assert inspect.getsource(dummy_model_instance.model.norm.forward) != inspect.getsource(LigerRMSNorm.forward)
+        for decoder_layer in dummy_model_instance.model.layers:
+            assert inspect.getsource(decoder_layer.mlp.forward) != inspect.getsource(LigerSwiGLUMLP.forward)


Same as above, mlp.forward shouldn't be patched if this is a moe layer. Need to check isinstance() first.

Tcc0403 · 2025-12-15T19:09:38Z

test/transformers/test_monkey_patch.py

+        assert inspect.getsource(dummy_model_instance.forward) == inspect.getsource(glm4_moe_lce_forward)
+        assert inspect.getsource(dummy_model_instance.model.norm.forward) == inspect.getsource(LigerRMSNorm.forward)
+        for decoder_layer in dummy_model_instance.base_model.layers:
+            if decoder_layer.mlp is not None:


check isinstance(decoder_layer.mlp, GLM4MoeMLP) instead of just is not None.

Tcc0403 · 2025-12-15T19:10:19Z

test/transformers/test_monkey_patch.py

+                assert inspect.getsource(decoder_layer.post_attention_layernorm.forward) == inspect.getsource(
+                    LigerRMSNormForGlm4.forward
+                )
+                assert inspect.getsource(decoder_layer.input_layernorm.forward) == inspect.getsource(
+                    LigerRMSNormForGlm4.forward
+                )


Wrong scope, rms norm should be checked in both mlp layer and moe layer

vvvdwbvvv · 2025-12-29T08:46:38Z

@Tcc0403 Fixed in 876d075

Tcc0403 · 2025-12-30T15:50:36Z

src/liger_kernel/transformers/monkey_patch.py

+            if isinstance(decoder_layer.mlp, Glm4MoeMoE):
+                experts = decoder_layer.mlp.experts
+                if experts is not None:
+                    for expert in experts:
+                        _patch_swiglu_module(expert, LigerSwiGLUMLP)
+
+                shared_experts = decoder_layer.mlp.shared_experts
+                if shared_experts is not None:
+                    _patch_swiglu_module(shared_experts, LigerSwiGLUMLP)


It's not under if swiglu: scope

Thank you, I'll fix it

…patch.py

vvvdwbvvv · 2026-01-06T07:59:03Z

@Tcc0403 Fixed in cf9f212

Tcc0403

Patching should work, but we need to tune down moe related numbers to speed up convergence tests.

Tcc0403 · 2026-01-06T09:19:25Z

test/convergence/bf16/test_mini_models.py

+        mini_model_config=Glm4MoeConfig(
+            bos_token_id=1,  # None
+            eos_token_id=2,  # 151329, 151336, 151338
+            pad_token_id=2,  # 151329
+            partial_rotary_factor=0.5,
+            cross_attention_layers=None,
+            dropout=0,
+            hidden_act="silu",
+            hidden_size=1024,  # 6144
+            initializer_range=0.02,
+            intermediate_size=2048,  # 14336
+            max_position_embeddings=4096,  # 32768
+            num_attention_heads=8,  # 48
+            num_hidden_layers=4,  # 61
+            num_key_value_heads=2,
+            rms_norm_eps=1e-5,
+            rope_scaling=None,
+            rope_theta=500_000,
+            tie_word_embeddings=False,
+            use_cache=True,
+            vocab_size=32000,  # 151552
+            attention_bias=True,
+            attn_implementation="sdpa",  # default value, pytorch native attention
+        ),


Set smaller experts related numbers as well

"moe_intermediate_size": 1408, "num_experts_per_tok": 2, "n_shared_experts": 1, "n_routed_experts": 8, "routed_scaling_factor": 1.0, "n_group": 1, "topk_group": 1, "first_k_dense_replace": 1,

Tcc0403 · 2026-01-06T09:19:51Z

test/convergence/bf16/test_mini_models_with_logits.py

+        liger_kernel_patch_func=apply_liger_kernel_to_glm4_moe,
+        liger_kernel_patch_revert_func=revert_liger_kernel_to_glm4_moe,
+        model_class=Glm4MoeForCausalLM,
+        mini_model_config=Glm4MoeConfig(


Tcc0403 · 2026-01-06T09:20:00Z

test/convergence/fp32/test_mini_models.py

+        liger_kernel_patch_func=apply_liger_kernel_to_glm4_moe,
+        liger_kernel_patch_revert_func=revert_liger_kernel_to_glm4_moe,
+        model_class=Glm4MoeForCausalLM,
+        mini_model_config=Glm4MoeConfig(


Tcc0403 · 2026-01-06T09:20:13Z

test/convergence/fp32/test_mini_models_with_logits.py

+        liger_kernel_patch_func=apply_liger_kernel_to_glm4_moe,
+        liger_kernel_patch_revert_func=revert_liger_kernel_to_glm4_moe,
+        model_class=Glm4MoeForCausalLM,
+        mini_model_config=Glm4MoeConfig(


…l_to_glm4_moe function

vvvdwbvvv · 2026-01-07T16:12:37Z

@Tcc0403 Fixed at 375903c, thank you.

Tcc0403

Although the current mini model configuration can verify the correctness of the patching, the training loss remains unchanged (>10) across all iterations (the loss curve does not decrease). I suggest commenting out this test case for now, adding a TODO note, and opening an issue to track and investigate this problem.

vvvdwbvvv added 6 commits November 25, 2025 10:54

[GLM4MOE] Add support for Liger kernel patches in GLM-4MOE models

26487c2

[GLM4MOE] Formatting functions

dac15e9

Rename function for GLM-4MOE kernel application and update model type…

14cfb90

… mapping

Refactor lce_forward function: update return type and remove deprecat…

973e418

…ed parameters

Fix import path for Glm4MoeConfig in test_apply_liger_kernel_to_insta…

39e7d18

…nce_for_glm4_moe

fix tests

ca27242

kashif reviewed Nov 25, 2025

View reviewed changes

modify to adapt to new API

5af9d16

shimizust assigned momochen Dec 9, 2025

Tcc0403 requested changes Dec 15, 2025

View reviewed changes

lancerts and others added 6 commits December 15, 2025 14:19

Merge branch 'main' into add-glm4moe

72c7ec6

Merge branch 'main' into add-glm4moe

2f320f6

Merge branch 'main' into add-glm4moe

00c0d13

Merge branch 'main' into add-glm4moe

0be4cc5

Enhance GLM4-MoE support by adding MLP handling in monkey patching

7f0ebf4

fix typo

5cf027e

vvvdwbvvv force-pushed the add-glm4moe branch from 79c22df to 5cf027e Compare December 29, 2025 08:32

fix typo

876d075

Tcc0403 reviewed Dec 30, 2025

View reviewed changes

vvvdwbvvv added 2 commits January 6, 2026 15:54

Merge remote-tracking branch 'origin/main' into add-glm4moe

c3eb0fc

fix: update Glm4Moe import and clean up MOE layer patching in monkey_…

cf9f212

…patch.py

Tcc0403 requested changes Jan 6, 2026

View reviewed changes

vvvdwbvvv added 2 commits January 7, 2026 15:20

fix: update rotary position embedding assignment in apply_liger_kerne…

ccc308a

…l_to_glm4_moe function

feat: add MOE configuration parameters for GLM4_MOE in test models

375903c

vvvdwbvvv force-pushed the add-glm4moe branch from 78838c8 to 375903c Compare January 7, 2026 16:04

Tcc0403 reviewed Jan 8, 2026

View reviewed changes

Conversation

vvvdwbvvv commented Nov 25, 2025

Summary

Testing Done

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vvvdwbvvv commented Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vvvdwbvvv commented Dec 29, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vvvdwbvvv commented Jan 6, 2026

Uh oh!

Tcc0403 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vvvdwbvvv commented Jan 7, 2026

Uh oh!

Tcc0403 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

vvvdwbvvv commented Nov 25, 2025 •

edited

Loading