NVFP4 Quantization for GLM 4.7 w MTP

**Before submitting an issue, please make sure it hasn't been already addressed by searching through the [existing and past issues](https://github.com/NVIDIA/Model-Optimizer/issues?q=is%3Aissue).**

## Describe the bug


I am trying to quantize [GLM 4.7](https://huggingface.co/zai-org/GLM-4.7/tree/main) to nvfp4 using ModelOpt. But 1) the MTP layer does not seem to be quantized and 2) there are issues serving this checkpoint using SGLang with MTP. This bug is specific to (1).

### Steps/Code to reproduce bug



```
git clone git@github.com:NVIDIA/Model-Optimizer.git
cd Model-Optimizer

pip install -e .[dev]

python examples/llm_ptq/hf_ptq.py --pyt_ckpt_path /models/glm-fp16 --qformat nvfp4_mlp_only --export_path /models/glm-nvfp4_mlp_only --kv_cache_qformat none --calib_size 64 --trust_remote_code --dataset cnn_dailymail
```

The hf_quant_config.json does not exclude the mtp layer `model.layers.92*`. I also tried manually adding `*mtp*": {"enable": False}` to quantization config `mtq.NVFP4_MLP_ONLY_CFG`. But I still do not see the mtp layer excluded from quantization. 

Also, the mtp.safetensors file does not seem to be carried over to the quantized checkpoint. When running modelopt script i see this message. it seems like mtp layer is not being loaded:
```
Some weights of the model checkpoint at 
/home/prithudasgupta_google_com/models/glm-fp16 were not used when initializing Glm4MoeForCausalLM: ['model.layers.92.eh_proj.weight', 'model.layers.92.embed_tokens.weight', 'model.layers.92.enorm.weight', 'model.layers.92.hnorm.weight', 'model.layers.92.input_layernorm.weight', 'model.layers.92.mlp.experts.0.down_proj.weight', 'model.layers.92.mlp.experts.0.gate_proj.weight', 'model.layers.92.mlp.experts.0.up_proj.weight', 'model.layers.92.mlp.experts.1.down_proj.weight', 'model.layers.92.mlp.experts.1.gate_proj.weight', 'model.layers.92.mlp.experts.1.up_proj.weight', 'model.layers.92.mlp.experts.10.down_proj.weight', 'model.layers.92.mlp.experts.10.gate_proj.weight', 'model.layers.92.mlp.experts.10.up_proj.weight', 'model.layers.92.mlp.experts.100.down_proj.weight', 'model.layers.92.mlp.experts.100.gate_proj.weight',
```

```
# hf_quant_config.json

{
    "producer": {
        "name": "modelopt",
        "version": "0.41.0.dev45+g381ac9dc5"
    },
    "quantization": {
        "quant_algo": "NVFP4",
        "kv_cache_quant_algo": null,
        "group_size": 16,
        "exclude_modules": [
            "lm_head",
            "model.layers.0.self_attn*",
            "model.layers.1.self_attn*",
            "model.layers.10.self_attn*",
            "model.layers.11.self_attn*",
            "model.layers.12.self_attn*",
            "model.layers.13.self_attn*",
            "model.layers.14.self_attn*",
            "model.layers.15.self_attn*",
            "model.layers.16.self_attn*",
            "model.layers.17.self_attn*",
            "model.layers.18.self_attn*",
            "model.layers.19.self_attn*",
            "model.layers.2.self_attn*",
            "model.layers.20.self_attn*",
            "model.layers.21.self_attn*",
            "model.layers.22.self_attn*",
            "model.layers.23.self_attn*",
            "model.layers.24.self_attn*",
            "model.layers.25.self_attn*",
            "model.layers.26.self_attn*",
            "model.layers.27.self_attn*",
            "model.layers.28.self_attn*",
            "model.layers.29.self_attn*",
            "model.layers.3.self_attn*",
            "model.layers.30.self_attn*",
            "model.layers.31.self_attn*",
            "model.layers.32.self_attn*",
            "model.layers.33.self_attn*",
            "model.layers.34.self_attn*",
            "model.layers.35.self_attn*",
            "model.layers.36.self_attn*",
            "model.layers.37.self_attn*",
            "model.layers.38.self_attn*",
            "model.layers.39.self_attn*",
            "model.layers.4.self_attn*",
            "model.layers.40.self_attn*",
            "model.layers.41.self_attn*",
            "model.layers.42.self_attn*",
            "model.layers.43.self_attn*",
            "model.layers.44.self_attn*",
            "model.layers.45.self_attn*",
            "model.layers.46.self_attn*",
            "model.layers.47.self_attn*",
            "model.layers.48.self_attn*",
            "model.layers.49.self_attn*",
            "model.layers.5.self_attn*",
            "model.layers.50.self_attn*",
            "model.layers.51.self_attn*",
            "model.layers.52.self_attn*",
            "model.layers.53.self_attn*",
            "model.layers.54.self_attn*",
            "model.layers.55.self_attn*",
            "model.layers.56.self_attn*",
            "model.layers.57.self_attn*",
            "model.layers.58.self_attn*",
            "model.layers.59.self_attn*",
            "model.layers.6.self_attn*",
            "model.layers.60.self_attn*",
            "model.layers.61.self_attn*",
            "model.layers.62.self_attn*",
            "model.layers.63.self_attn*",
            "model.layers.64.self_attn*",
            "model.layers.65.self_attn*",
            "model.layers.66.self_attn*",
            "model.layers.67.self_attn*",
            "model.layers.68.self_attn*",
            "model.layers.69.self_attn*",
            "model.layers.7.self_attn*",
            "model.layers.70.self_attn*",
            "model.layers.71.self_attn*",
            "model.layers.72.self_attn*",
            "model.layers.73.self_attn*",
            "model.layers.74.self_attn*",
            "model.layers.75.self_attn*",
            "model.layers.76.self_attn*",
            "model.layers.77.self_attn*",
            "model.layers.78.self_attn*",
            "model.layers.79.self_attn*",
            "model.layers.8.self_attn*",
            "model.layers.80.self_attn*",
            "model.layers.81.self_attn*",
            "model.layers.82.self_attn*",
            "model.layers.83.self_attn*",
            "model.layers.84.self_attn*",
            "model.layers.85.self_attn*",
            "model.layers.86.self_attn*",
            "model.layers.87.self_attn*",
            "model.layers.88.self_attn*",
            "model.layers.89.self_attn*",
            "model.layers.9.self_attn*",
            "model.layers.90.self_attn*",
            "model.layers.91.self_attn*"
        ]
    }
}
```

### Expected behavior

- MTP layer is excluded from quantization
- MTP safetensors file is carried over
- NVFP4 quantized checkpoint can be served with MTP in SGLang with no loading errors and better performance

### Who can help?



- ?

## System information



- Container used (if applicable): ?
- OS (e.g., Ubuntu 22.04, CentOS 7, Windows 10): ? 
- CPU architecture (x86_64, aarch64): ?
- GPU name (e.g. H100, A100, L40S): ?
- GPU memory size: ?
- Number of GPUs: ?
- Library versions (if applicable):
  - Python: ?
  - ModelOpt version or commit hash: ?
  - CUDA: ?
  - PyTorch: ?
  - Transformers: ?
  - TensorRT-LLM: ?
  - ONNXRuntime: ?
  - TensorRT: ?
- Any other details that may help: ?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NVFP4 Quantization for GLM 4.7 w MTP #750

Describe the bug

Steps/Code to reproduce bug

Expected behavior

Who can help?

System information

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

NVFP4 Quantization for GLM 4.7 w MTP #750

Description

Describe the bug

Steps/Code to reproduce bug

Expected behavior

Who can help?

System information

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions