-
-
Notifications
You must be signed in to change notification settings - Fork 9.1k
[NVIDIA] Fix Llama4 Scout FP4 functionality issues #21499
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[NVIDIA] Fix Llama4 Scout FP4 functionality issues #21499
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request addresses weight loading and accuracy issues in the NVIDIA ModelOpt Llama4 Scout FP4 model. The changes include updates to the FlashInfer attention backend, a workaround in the CUTLASS MoE kernel, and corrections to weight/scale loading logic for quantized Llama4 models. A potential inconsistency in MoE scale loading in vllm/model_executor/models/llama4.py
has been identified and flagged as high severity.
97acab5
to
78aa123
Compare
765aaff
to
afaf28d
Compare
This PR is ready for review. Thanks @jingyu-ml for helping |
The fastcheck failure doesn't seem to be caused by my change? |
f75578d
to
91ec86d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. It would be nicer if we had an attribute registered to the parameter to know if fp4. Currently the uint8 logic could affect future formats
@nvpohanh please merge with main and fix the pre-commit errors to resolve the test failures |
Agreed. @jingyu-ml for vis |
The precommit failure doesn't seem to be caused by my change... let me try again |
91ec86d
to
6ecb2bc
Compare
The buildkite/ci/pr/distributed-tests-2-gpus failures do not seem to be caused by my change... |
6ecb2bc
to
118cc65
Compare
Okay, I see that the test failures are indeed caused by my change:
I will debug this. |
We have confirmed that the breakage is real and is fixing it. Please do not merge this PR for now. |
118cc65
to
707db88
Compare
pushed a fix to fix the pipeline failure |
verified accuracy:
|
1c2a127
to
23bd139
Compare
I found that my previous accuracy check was FP8... this time is FP4 for real:
|
23bd139
to
c2f113a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found this breaks Llama4 NVFP4 with compressed tensors
lm_eval --model vllm --model_args pretrained=RedHatAI/Llama-4-Scout-17B-16E-Instruct-NVFP4,max_model_len=10000,enforce_eager=True,tensor_parallel_size=2 --trust_remote_code --tasks gsm8k --num_fewshot 5 --batch_size auto
...
File "/home/mgoin/code/vllm/vllm/model_executor/models/llama4.py", line 475, in load_weights
moe_loaded = self.load_moe_expert_weights(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/mgoin/code/vllm/vllm/model_executor/models/llama4.py", line 402, in load_moe_expert_weights
weight_loader(param,
File "/home/mgoin/code/vllm/vllm/model_executor/layers/fused_moe/layer.py", line 1202, in weight_loader
self._load_model_weight_or_group_weight_scale(
File "/home/mgoin/code/vllm/vllm/model_executor/layers/fused_moe/layer.py", line 904, in _load_model_weight_or_group_weight_scale
self._load_w2(shard_dim=shard_dim,
File "/home/mgoin/code/vllm/vllm/model_executor/layers/fused_moe/layer.py", line 971, in _load_w2
expert_data.copy_(loaded_weight)
RuntimeError: The size of tensor a (5120) must match the size of tensor b (4096) at non-singleton dimension 0
On main I'm able to run the eval correctly
vllm (pretrained=RedHatAI/Llama-4-Scout-17B-16E-Instruct-NVFP4,max_model_len=10000,enforce_eager=True,tensor_parallel_size=2,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto
|Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.9090|± |0.0079|
| | |strict-match | 5|exact_match|↑ |0.8992|± |0.0083|
@mgoin I will debug this today |
a46f5df
to
bb1e7e0
Compare
Pushed a new fix and added a bunch of comments to explain what's going on. Accuracy tests: ModelOpt Scout FP8:
ModelOpt Scout FP4:
RedHat Scout NVFP4:
I also verified with reduced maverick (used in pipeline) and it worked. I only ran TP1 and didn't have the chance to run TP2. However, I think my latest change is not related to sharding logic, so should be okay. |
Fix the weight loading issues and accuray issues when using the NVIDIA ModelOpt Llama4 Scout FP4 model. Signed-off-by: Po-Han Huang <[email protected]>
bb1e7e0
to
edfd4f9
Compare
Need further debugging... |
I see the same tests also failed in #21921 so they are probably not caused by my change... |
I saw errors like this in pipeline logs:
But is that caused by my change? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks in a good state to me now, thanks for the hard work.
Validated existing FP8, INT4, and FP4 models are unaffected
vllm (pretrained=RedHatAI/Llama-4-Scout-17B-16E-Instruct-FP8-dynamic,max_model_len=10000,tensor_parallel_size=2,enforce_eager=True,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto
|Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.9037|± |0.0081|
| | |strict-match | 5|exact_match|↑ |0.8901|± |0.0086|
vllm (pretrained=RedHatAI/Llama-4-Scout-17B-16E-Instruct-quantized.w4a16,max_model_len=10000,tensor_parallel_size=2,enforce_eager=True,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto
|Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.9151|± |0.0077|
| | |strict-match | 5|exact_match|↑ |0.8961|± |0.0084|
vllm (pretrained=RedHatAI/Llama-4-Scout-17B-16E-Instruct-NVFP4,max_model_len=10000,enforce_eager=True,tensor_parallel_size=2,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto
|Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.9075|± |0.0080|
| | |strict-match | 5|exact_match|↑ |0.8992|± |0.0083|
Signed-off-by: Po-Han Huang <[email protected]>
Signed-off-by: Po-Han Huang <[email protected]>
Fix the weight loading issues and accuray issues when using the NVIDIA ModelOpt Llama4 Scout FP4 model.
Essential Elements of an Effective PR Description Checklist
supported_models.md
andexamples
for a new model.Purpose
Fix the weight loading issues and accuray issues when using the NVIDIA ModelOpt Llama4 Scout FP4 model.
Test Plan
Run Scout FP4/FP8 accuracy tests on TP2.
Test Result
Scout FP4 TP2:
Scout FP8 TP2:
(Optional) Documentation Update