Skip to content

[NVIDIA] Fix Llama4 Scout FP4 functionality issues #21499

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jul 30, 2025

Conversation

nvpohanh
Copy link
Contributor

@nvpohanh nvpohanh commented Jul 24, 2025

Fix the weight loading issues and accuray issues when using the NVIDIA ModelOpt Llama4 Scout FP4 model.

Essential Elements of an Effective PR Description Checklist

  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Purpose

Fix the weight loading issues and accuray issues when using the NVIDIA ModelOpt Llama4 Scout FP4 model.

Test Plan

Run Scout FP4/FP8 accuracy tests on TP2.

Test Result

Scout FP4 TP2:

vllm (pretrained=nvidia/Llama-4-Scout-17B-16E-Instruct-FP4,quantization=modelopt_fp4,tensor_parallel_size=2,max_model_len=2048,enforce_eager=True,kv_cache_dtype=auto,trust_
remote_code=True), gen_kwargs: (temperature=0.0), limit: 500.0, num_fewshot: 5, batch_size: 200                                                                                                               
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|                                                                                                                                      
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|                                                                                                                                      
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.912|±  |0.0127|                                                                                                                                      
|     |       |strict-match    |     5|exact_match|↑  |0.900|±  |0.0134|

Scout FP8 TP2:

vllm (pretrained=nvidia/Llama-4-Scout-17B-16E-Instruct-FP8,quantization=modelopt,tensor_parallel_size=2,max_model_len=2048,enforce_eager=True,kv_cache_dtype=auto,trust_remote_code=True), gen_kwargs: (temperature=0.0), limit: 500.0, num_fewshot: 5, batch_size: 200
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.928|±  |0.0116|
|     |       |strict-match    |     5|exact_match|↑  |0.894|±  |0.0138|

(Optional) Documentation Update

Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@mergify mergify bot added llama Related to Llama models v1 labels Jul 24, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses weight loading and accuracy issues in the NVIDIA ModelOpt Llama4 Scout FP4 model. The changes include updates to the FlashInfer attention backend, a workaround in the CUTLASS MoE kernel, and corrections to weight/scale loading logic for quantized Llama4 models. A potential inconsistency in MoE scale loading in vllm/model_executor/models/llama4.py has been identified and flagged as high severity.

@nvpohanh
Copy link
Contributor Author

DO NOT MERGE yet since this depends on #21485 and #21465

@nvpohanh nvpohanh force-pushed the dev/nvpohanh/scout-nvfp4-fix branch from 97acab5 to 78aa123 Compare July 24, 2025 09:09
@nvpohanh nvpohanh force-pushed the dev/nvpohanh/scout-nvfp4-fix branch 2 times, most recently from 765aaff to afaf28d Compare July 25, 2025 08:18
@nvpohanh nvpohanh marked this pull request as ready for review July 25, 2025 08:19
@nvpohanh
Copy link
Contributor Author

This PR is ready for review. Thanks @jingyu-ml for helping

@nvpohanh
Copy link
Contributor Author

The fastcheck failure doesn't seem to be caused by my change?

https://buildkite.com/vllm/fastcheck/builds/32228/steps/canvas?jid=019840ab-8eea-4798-aaea-ad2e2c6773ea

@nvpohanh nvpohanh force-pushed the dev/nvpohanh/scout-nvfp4-fix branch 2 times, most recently from f75578d to 91ec86d Compare July 25, 2025 12:51
@mgoin mgoin added the ready ONLY add when PR is ready to merge/full CI is needed label Jul 25, 2025
Copy link
Member

@mgoin mgoin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. It would be nicer if we had an attribute registered to the parameter to know if fp4. Currently the uint8 logic could affect future formats

@mgoin
Copy link
Member

mgoin commented Jul 26, 2025

@nvpohanh please merge with main and fix the pre-commit errors to resolve the test failures

@nvpohanh
Copy link
Contributor Author

LGTM. It would be nicer if we had an attribute registered to the parameter to know if fp4. Currently the uint8 logic could affect future formats

Agreed. @jingyu-ml for vis

@nvpohanh
Copy link
Contributor Author

Run pre-commit run --show-diff-on-failure --color=always --all-files --hook-stage manual
yapf................................................................................................Failed
- hook id: yapf
- exit code: 1

Traceback (most recent call last):
  File "/home/runner/.cache/pre-commit/repo20fqe0ai/py_env-python3.12/bin/yapf", line 5, in <module>
    from yapf import run_main
  File "/home/runner/.cache/pre-commit/repo20fqe0ai/py_env-python3.12/lib/python3.12/site-packages/yapf/__init__.py", line 40, in <module>
    from yapf.yapflib import yapf_api
  File "/home/runner/.cache/pre-commit/repo20fqe0ai/py_env-python3.12/lib/python3.12/site-packages/yapf/yapflib/yapf_api.py", line 38, in <module>
    from yapf.pyparser import pyparser
  File "/home/runner/.cache/pre-commit/repo20fqe0ai/py_env-python3.12/lib/python3.12/site-packages/yapf/pyparser/pyparser.py", line 44, in <module>
    from yapf.yapflib import format_token
  File "/home/runner/.cache/pre-commit/repo20fqe0ai/py_env-python3.12/lib/python3.12/site-packages/yapf/yapflib/format_token.py", line 23, in <module>
    from yapf.pytree import pytree_utils
  File "/home/runner/.cache/pre-commit/repo20fqe0ai/py_env-python3.12/lib/python3.12/site-packages/yapf/pytree/pytree_utils.py", line 30, in <module>
    from yapf_third_party._ylib2to3 import pygram
  File "/home/runner/.cache/pre-commit/repo20fqe0ai/py_env-python3.12/lib/python3.12/site-packages/yapf_third_party/_ylib2to3/pygram.py", line 39, in <module>
    pattern_grammar = driver.load_grammar(_PATTERN_GRAMMAR_FILE)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/runner/.cache/pre-commit/repo20fqe0ai/py_env-python3.12/lib/python3.12/site-packages/yapf_third_party/_ylib2to3/pgen2/driver.py", line 248, in load_grammar
    g.load(gp)
  File "/home/runner/.cache/pre-commit/repo20fqe0ai/py_env-python3.12/lib/python3.12/site-packages/yapf_third_party/_ylib2to3/pgen2/grammar.py", line [128](https://github.com/vllm-project/vllm/actions/runs/16522417154/job/46727126025?pr=21499#step:6:133), in load
    d = pickle.load(f)
        ^^^^^^^^^^^^^^
EOFError: Ran out of input

The precommit failure doesn't seem to be caused by my change... let me try again

@nvpohanh nvpohanh force-pushed the dev/nvpohanh/scout-nvfp4-fix branch from 91ec86d to 6ecb2bc Compare July 28, 2025 01:46
@nvpohanh
Copy link
Contributor Author

The buildkite/ci/pr/distributed-tests-2-gpus failures do not seem to be caused by my change...

@nvpohanh
Copy link
Contributor Author

Okay, I see that the test failures are indeed caused by my change:

[2025-07-28T06:14:57Z] �[31mFAILED�[0m models/multimodal/generation/test_maverick.py::�[1mtest_dummy_maverick[2-True-True-meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8-4-4-2]�[0m - AssertionError: function <function test_dummy_maverick at 0x7f0004fd4d60> failed when called with args () and kwargs {'monkeypatch': <_pytest.monkeypatch.MonkeyPatch object at 0x7f000b1545f0>, 'original_model_name': 'meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8', 'text_layers': 4, 'num_experts': 4, 'vision_layers': 2, 'enforce_eager': True, 'tp': 2, 'ep': True}
[2025-07-28T06:14:57Z] �[31mFAILED�[0m models/multimodal/generation/test_maverick.py::�[1mtest_dummy_maverick[2-True-False-meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8-4-4-2]�[0m - AssertionError: function <function test_dummy_maverick at 0x7f0004fd4d60> failed when called with args () and kwargs {'monkeypatch': <_pytest.monkeypatch.MonkeyPatch object at 0x7efffcfa8ef0>, 'original_model_name': 'meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8', 'text_layers': 4, 'num_experts': 4, 'vision_layers': 2, 'enforce_eager': False, 'tp': 2, 'ep': True}

I will debug this.

@nvpohanh
Copy link
Contributor Author

We have confirmed that the breakage is real and is fixing it. Please do not merge this PR for now.

@nvpohanh nvpohanh force-pushed the dev/nvpohanh/scout-nvfp4-fix branch from 118cc65 to 707db88 Compare July 29, 2025 10:39
@nvpohanh
Copy link
Contributor Author

pushed a fix to fix the pipeline failure

@nvpohanh
Copy link
Contributor Author

verified accuracy:

vllm (pretrained=nvidia/Llama-4-Scout-17B-16E-Instruct-FP8,quantization=modelopt,tensor_parallel_size=1,max_model_len=2048,enforce_eager=True,kv_cache_dtype=auto,trust_remote_code=True), gen_kwargs: (temperature=0.0), limit: 500.0, num_fewshot: 5, batch_size: 200
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.932|±  |0.0113|
|     |       |strict-match    |     5|exact_match|↑  |0.914|±  |0.0126|

@nvpohanh nvpohanh force-pushed the dev/nvpohanh/scout-nvfp4-fix branch from 1c2a127 to 23bd139 Compare July 29, 2025 13:27
@nvpohanh
Copy link
Contributor Author

I found that my previous accuracy check was FP8... this time is FP4 for real:

vllm (pretrained=nvidia/Llama-4-Scout-17B-16E-Instruct-FP4,quantization=modelopt_fp4,tensor_parallel_size=1,max_model_len=2048,enforce_eager=True,kv_cache_dtype=auto,trust_remote_code=True), gen_kwargs: (temperature=0.0), limit: 500.0, num_fewshot: 5, batch_size: 200
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.910|±  |0.0128|
|     |       |strict-match    |     5|exact_match|↑  |0.896|±  |0.0137|

@nvpohanh nvpohanh force-pushed the dev/nvpohanh/scout-nvfp4-fix branch from 23bd139 to c2f113a Compare July 29, 2025 13:42
Copy link
Member

@mgoin mgoin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found this breaks Llama4 NVFP4 with compressed tensors

lm_eval --model vllm --model_args pretrained=RedHatAI/Llama-4-Scout-17B-16E-Instruct-NVFP4,max_model_len=10000,enforce_eager=True,tensor_parallel_size=2 --trust_remote_code --tasks gsm8k --num_fewshot 5 --batch_size auto
...
  File "/home/mgoin/code/vllm/vllm/model_executor/models/llama4.py", line 475, in load_weights
    moe_loaded = self.load_moe_expert_weights(
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mgoin/code/vllm/vllm/model_executor/models/llama4.py", line 402, in load_moe_expert_weights
    weight_loader(param,
  File "/home/mgoin/code/vllm/vllm/model_executor/layers/fused_moe/layer.py", line 1202, in weight_loader
    self._load_model_weight_or_group_weight_scale(
  File "/home/mgoin/code/vllm/vllm/model_executor/layers/fused_moe/layer.py", line 904, in _load_model_weight_or_group_weight_scale
    self._load_w2(shard_dim=shard_dim,
  File "/home/mgoin/code/vllm/vllm/model_executor/layers/fused_moe/layer.py", line 971, in _load_w2
    expert_data.copy_(loaded_weight)
RuntimeError: The size of tensor a (5120) must match the size of tensor b (4096) at non-singleton dimension 0

On main I'm able to run the eval correctly

vllm (pretrained=RedHatAI/Llama-4-Scout-17B-16E-Instruct-NVFP4,max_model_len=10000,enforce_eager=True,tensor_parallel_size=2,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9090|±  |0.0079|
|     |       |strict-match    |     5|exact_match|↑  |0.8992|±  |0.0083|

@nvpohanh
Copy link
Contributor Author

@mgoin I will debug this today

@nvpohanh nvpohanh force-pushed the dev/nvpohanh/scout-nvfp4-fix branch from a46f5df to bb1e7e0 Compare July 30, 2025 08:19
@nvpohanh
Copy link
Contributor Author

Pushed a new fix and added a bunch of comments to explain what's going on.

Accuracy tests:

ModelOpt Scout FP8:

lm_eval --model vllm --model_args pretrained=nvidia/Llama-4-Scout-17B-16E-Instruct-FP8,quantization=modelopt,tensor_parallel_size=1,max_model_len=2048,enforce_eager=True,kv_cache_dtype=auto --gen_kwargs temperature=0.0 --limit 500 --trust_remote_code --tasks gsm8k --num_fewshot 5 --batch_size 200

vllm (pretrained=nvidia/Llama-4-Scout-17B-16E-Instruct-FP8,quantization=modelopt,tensor_parallel_size=1,max_model_len=2048,enforce_eager=True,kv_cache_dtype=auto,trust_remote_code=True), gen_kwargs: (temperature=0.0), limit: 500.0, num_fewshot: 5, batch_size: 200
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.932|±  |0.0113|
|     |       |strict-match    |     5|exact_match|↑  |0.912|±  |0.0127|

ModelOpt Scout FP4:

lm_eval --model vllm --model_args pretrained=nvidia/Llama-4-Scout-17B-16E-Instruct-FP4,quantization=modelopt_fp4,tensor_parallel_size=1,max_model_len=2048,enforce_eager=True,kv_cache_dtype=auto --gen_kwargs temperature=0.0 --limit 500 --trust_remote_code --tasks gsm8k --num_fewshot 5 --batch_size 200

vllm (pretrained=nvidia/Llama-4-Scout-17B-16E-Instruct-FP4,quantization=modelopt_fp4,tensor_parallel_size=1,max_model_len=2048,enforce_eager=True,kv_cache_dtype=auto,trust_remote_code=True), gen_kwargs: (temperature=0.0), limit: 500.0, num_fewshot: 5, batch_size: 200
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  | 0.91|±  |0.0128|
|     |       |strict-match    |     5|exact_match|↑  | 0.90|±  |0.0134|

RedHat Scout NVFP4:

lm_eval --model vllm --model_args pretrained=RedHatAI/Llama-4-Scout-17B-16E-Instruct-NVFP4,tensor_parallel_size=1,max_model_len=2048,enforce_eager=True,kv_cache_dtype=auto --gen_kwargs temperature=0.0 --limit 500 --trust_remote_code --tasks gsm8k --num_fewshot 5 --batch_size 200

vllm (pretrained=RedHatAI/Llama-4-Scout-17B-16E-Instruct-NVFP4,tensor_parallel_size=1,max_model_len=2048,enforce_eager=True,kv_cache_dtype=auto,trust_remote_code=True), gen_kwargs: (temperature=0.0), limit: 500.0, num_fewshot: 5, batch_size: 200
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.916|±  |0.0124|
|     |       |strict-match    |     5|exact_match|↑  |0.900|±  |0.0134|

I also verified with reduced maverick (used in pipeline) and it worked.

I only ran TP1 and didn't have the chance to run TP2. However, I think my latest change is not related to sharding logic, so should be okay.

Fix the weight loading issues and accuray issues when using the NVIDIA
ModelOpt Llama4 Scout FP4 model.

Signed-off-by: Po-Han Huang <[email protected]>
@nvpohanh nvpohanh force-pushed the dev/nvpohanh/scout-nvfp4-fix branch from bb1e7e0 to edfd4f9 Compare July 30, 2025 09:16
@nvpohanh
Copy link
Contributor Author

[2025-07-30T09:57:30Z] �[31mERROR�[0m v1/test_external_lb_dp.py::�[1mtest_external_lb_single_completion[4-ibm-research/PowerMoE-3b]�[0m - Exception: Servers failed to start
[2025-07-30T09:57:30Z] �[31mERROR�[0m v1/test_external_lb_dp.py::�[1mtest_external_lb_completion_streaming[4-ibm-research/PowerMoE-3b]�[0m - Exception: Servers failed to start

Need further debugging...

@nvpohanh
Copy link
Contributor Author

I see the same tests also failed in #21921 so they are probably not caused by my change...

@nvpohanh
Copy link
Contributor Author

I saw errors like this in pipeline logs:

[2025-07-30T10:52:26Z] E       Please pass the argument `trust_remote_code=True` to allow custom code to be run. [type=value_error, input_value=ArgsKwargs(('Skywork/Skyw...se, 'hf_overrides': {}}), input_type=ArgsKwargs]

But is that caused by my change?

Copy link
Member

@mgoin mgoin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks in a good state to me now, thanks for the hard work.

Validated existing FP8, INT4, and FP4 models are unaffected

vllm (pretrained=RedHatAI/Llama-4-Scout-17B-16E-Instruct-FP8-dynamic,max_model_len=10000,tensor_parallel_size=2,enforce_eager=True,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9037|±  |0.0081|
|     |       |strict-match    |     5|exact_match|↑  |0.8901|±  |0.0086|

vllm (pretrained=RedHatAI/Llama-4-Scout-17B-16E-Instruct-quantized.w4a16,max_model_len=10000,tensor_parallel_size=2,enforce_eager=True,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9151|±  |0.0077|
|     |       |strict-match    |     5|exact_match|↑  |0.8961|±  |0.0084|

vllm (pretrained=RedHatAI/Llama-4-Scout-17B-16E-Instruct-NVFP4,max_model_len=10000,enforce_eager=True,tensor_parallel_size=2,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9075|±  |0.0080|
|     |       |strict-match    |     5|exact_match|↑  |0.8992|±  |0.0083|

@vllm-bot vllm-bot merged commit ff08e51 into vllm-project:main Jul 30, 2025
72 of 77 checks passed
liuyumoye pushed a commit to liuyumoye/vllm that referenced this pull request Jul 31, 2025
juuice-lee pushed a commit to juuice-lee/vllm-moe.code that referenced this pull request Jul 31, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
llama Related to Llama models ready ONLY add when PR is ready to merge/full CI is needed v1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants