Skip to content

Conversation

baonudesifeizhai
Copy link
Contributor

@baonudesifeizhai baonudesifeizhai commented Sep 8, 2025

Purpose

This PR adds EAGLE3 speculative decoding support for GPT-OSS models, enabling significant inference speedup through parallel token generation.

  1. Model Architecture Support
  2. GPT-OSS Model Integration
  3. Configuration Fixes
  4. EAGLE3 Implementation

Test Plan

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

baonudesifeizhai and others added 18 commits September 7, 2025 19:53
- Add LlamaForCausalLMEagle3 class in llama_eagle3.py
- Update registry to support LlamaForCausalLMEagle3 architecture
- Re-enable LlamaForCausalLMEagle3 in test configurations
- Add qwen3_eagle3 test case back to spec_decode tests

This enables support for GPT-OSS-120B Eagle3 speculative decoding models.
- Handle empty quantization method string in SpeculativeConfig
- Fix code formatting in LlamaForCausalLMEagle3 class

This resolves the 'Unknown quantization method: .' error when using
GPT-OSS-120B Eagle3 speculative decoding models.
- Add 'gpt_oss' to eagle3_target_supported list
- Add field_validator for SpeculativeConfig quantization field
- Handle empty quantization method strings properly

This should resolve the remaining quantization validation errors
when using GPT-OSS-120B Eagle3 models.
- Move field_validator to correct position after field definitions
- This should properly handle empty quantization method strings
- Resolves the remaining 'Unknown quantization method: .' error
- Add None to QuantizationMethods Literal type
- Handle '.' quantization method string in field_validator
- This should resolve the 'Unknown quantization method: .' error

The issue was that QuantizationMethods Literal type didn't include None,
and the Eagle3 model config contains a '.' quantization method value.
- Filter out None values from QUANTIZATION_METHODS list
- Fix line length formatting issue
- Resolve KeyError when quantization is None
- Fix spacing and line formatting issues
- Ensure consistent code style
- Handle empty string and dot (.) quantization values
- Convert them to None to prevent validation errors
- Apply same logic as SpeculativeConfig validator
- Add empty string ('') and dot ('.') as valid quantization methods
- This allows Pydantic validation to pass before field_validator processes them
- Fixes the 'Unknown quantization method' validation error
- Only set quantization if quant_method is not empty string
- Prevents empty string from being assigned to self.quantization
- Fixes the 'Unknown quantization method' validation error
- Addresses the root cause of the quantization validation issue
- Add _ensure_draft_vocab_size method to set draft_vocab_size from target model vocab_size
- Prevents TypeError when draft_vocab_size is None in ParallelLMHead initialization
- Ensures Eagle3 models have proper vocabulary size configuration
- Break long lines to comply with 80 character limit
- Improve code readability and maintain style consistency
- Use yapf to automatically format code according to project standards
- Improve code consistency and readability
…patibility

- Add SupportsEagle3 interface to GptOssForCausalLM class
- Implement set_aux_hidden_state_layers and get_eagle3_aux_hidden_state_layers methods
- Fix embedding attribute compatibility in Eagle3 loader to handle both embed_tokens and embedding
- Support dynamic detection of embedding layer names across different model architectures
- Remove unused variable to fix linting errors
- Implement proper layer selection for auxiliary hidden state extraction
- Fix forward method to return auxiliary states when EAGLE3 is enabled
- Use middle layers for auxiliary state extraction (common EAGLE3 pattern)
- Apply yapf formatting to fix line length issues
- Ensure compatibility with EAGLE3's expectation of tuple return values
Copy link
Collaborator

@houseroad houseroad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add some perf numbers to the PR description?

@baonudesifeizhai
Copy link
Contributor Author

Can we add some perf numbers to the PR description?

sure currently try to test delete the def _validate_quantization_method(value: Any) -> Any: work or not

@frank-wei
Copy link
Contributor

Is the draft model trained in house? Or used any OSS one?

@zixi-qi
Copy link
Collaborator

zixi-qi commented Sep 9, 2025

_validate_quantization_method

I ran into same issue with the EAGLE3 model from NVIDIA and fixed it with this change

diff --git a/vllm/config/__init__.py b/vllm/config/__init__.py

def iter_architecture_defaults():
     yield from _SUFFIX_TO_DEFAULTS
@@ -1120,6 +1139,8 @@ class ModelConfig:
                 elif quant_algo is not None:
                     raise ValueError(
                         f"Unknown ModelOpt quant algo: {quant_algo}")
+                else:
+                    quant_cfg = None
 
         return quant_cfg

@baonudesifeizhai
Copy link
Contributor Author

_validate_quantization_method

I ran into same issue with the EAGLE3 model from NVIDIA and fixed it with this change

diff --git a/vllm/config/__init__.py b/vllm/config/__init__.py

def iter_architecture_defaults():
     yield from _SUFFIX_TO_DEFAULTS
@@ -1120,6 +1139,8 @@ class ModelConfig:
                 elif quant_algo is not None:
                     raise ValueError(
                         f"Unknown ModelOpt quant algo: {quant_algo}")
+                else:
+                    quant_cfg = None
 
         return quant_cfg
image still facing missng TypeError: unsupported operand type(s) for -: 'NoneType' and 'NoneType'。https://pastebin.ubuntu.com/p/q2Ft53cKY9/

- Add GPT-OSS specific configuration handling in _setup_model_specific_config
- Ensure fc layer input dimensions match target model's hidden size
- Fix RuntimeError: mat1 and mat2 shapes cannot be multiplied

This resolves the dimension mismatch between GPT-OSS-120B (16128) and
EAGLE3 model expectations (8640x2880).
- Add debug prints to understand fc layer dimensions
- Help diagnose matrix dimension mismatch issue
- Remove unused _setup_model_specific_config method
- Remove debug print statements
- Keep only essential combine_hidden_states override for dimension mismatch handling
- Simplify the class to focus on core functionality
- Ensure dynamically created fc layer is on the same device as hidden_states
- Add .to(device) to move the new Linear layer to GPU
- Resolves RuntimeError: Expected all tensors to be on the same device
- Add target_hidden_size configuration in _ensure_draft_vocab_size method
- Ensure EAGLE3 model uses correct input dimensions for fc layer
- Remove runtime dimension mismatch fixes (no longer needed)
- Fix code formatting issues

This should resolve the matrix dimension mismatch errors for GPT-OSS-120B with EAGLE3
Copy link

mergify bot commented Sep 10, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @baonudesifeizhai.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Sep 10, 2025
- Add debug prints in Eagle3LlamaForCausalLM and LlamaForCausalLMEagle3 constructors
- This will help identify if our custom class is being instantiated
- Needed to debug why matrix dimension mismatch still occurs
- Adopt main branch's solution for LlamaForCausalLMEagle3 mapping
- LlamaForCausalLMEagle3 now maps to Eagle3LlamaForCausalLM class
- Re-enable EAGLE3 tests as fixed in main branch
- Remove TODO comments that are no longer needed
- Remove debug prints from Eagle3LlamaForCausalLM and LlamaForCausalLMEagle3
- Main branch has already fixed the EAGLE3 issues
- Clean up code for production use
@mergify mergify bot removed the needs-rebase label Sep 10, 2025
- Add _ensure_draft_vocab_size method to Eagle3LlamaForCausalLM class
- This ensures draft_vocab_size and target_hidden_size are properly set
- Fixes TypeError when draft_vocab_size is None in ParallelLMHead initialization
- Now LlamaForCausalLMEagle3 models can work with Eagle3LlamaForCausalLM class
- Remove empty string checks in SpeculativeConfig.__post_init__
- Remove empty string checks in _verify_quantization method
- Empty strings should be handled at CLI level via optional_type function
- Python API should not pass empty strings for quantization methods
- Add else branch to set quant_cfg = None when quant_algo is None
- This prevents quant_cfg from being undefined in ModelOpt quantization handling
- Fixes issue with EAGLE3 models from NVIDIA that don't specify quant_algo
- Add debug prints in combine_hidden_states to show input/output shapes
- This will help identify why fc layer dimensions don't match
- Need to understand target_hidden_size vs actual input dimensions
- Add logic to resize fc layer based on actual weight dimensions
- When target_hidden_size is None, infer correct dimensions from loaded weights
- This fixes the matrix dimension mismatch (16128 vs 8640) issue
- Update target_hidden_size in config after resizing for consistency
- Move dynamic fc layer resizing from load_weights to combine_hidden_states
- This ensures the layer is resized at runtime when dimensions don't match
- Add more debug output to track the resizing process
- This should fix the matrix dimension mismatch issue
@baonudesifeizhai
Copy link
Contributor Author

still not fix it , have some problem with chat

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
gpt-oss Related to GPT-OSS models llama Related to Llama models new-model Requests to new models speculative-decoding v1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants