-
-
Notifications
You must be signed in to change notification settings - Fork 10k
Enable gptoss eagle3 support#24325 #24410
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Enable gptoss eagle3 support#24325 #24410
Conversation
- Add LlamaForCausalLMEagle3 class in llama_eagle3.py - Update registry to support LlamaForCausalLMEagle3 architecture - Re-enable LlamaForCausalLMEagle3 in test configurations - Add qwen3_eagle3 test case back to spec_decode tests This enables support for GPT-OSS-120B Eagle3 speculative decoding models.
- Handle empty quantization method string in SpeculativeConfig - Fix code formatting in LlamaForCausalLMEagle3 class This resolves the 'Unknown quantization method: .' error when using GPT-OSS-120B Eagle3 speculative decoding models.
- Add 'gpt_oss' to eagle3_target_supported list - Add field_validator for SpeculativeConfig quantization field - Handle empty quantization method strings properly This should resolve the remaining quantization validation errors when using GPT-OSS-120B Eagle3 models.
- Move field_validator to correct position after field definitions - This should properly handle empty quantization method strings - Resolves the remaining 'Unknown quantization method: .' error
- Add None to QuantizationMethods Literal type - Handle '.' quantization method string in field_validator - This should resolve the 'Unknown quantization method: .' error The issue was that QuantizationMethods Literal type didn't include None, and the Eagle3 model config contains a '.' quantization method value.
- Filter out None values from QUANTIZATION_METHODS list - Fix line length formatting issue - Resolve KeyError when quantization is None
- Fix spacing and line formatting issues - Ensure consistent code style
- Handle empty string and dot (.) quantization values - Convert them to None to prevent validation errors - Apply same logic as SpeculativeConfig validator
- Add empty string ('') and dot ('.') as valid quantization methods - This allows Pydantic validation to pass before field_validator processes them - Fixes the 'Unknown quantization method' validation error
- Only set quantization if quant_method is not empty string - Prevents empty string from being assigned to self.quantization - Fixes the 'Unknown quantization method' validation error - Addresses the root cause of the quantization validation issue
- Add _ensure_draft_vocab_size method to set draft_vocab_size from target model vocab_size - Prevents TypeError when draft_vocab_size is None in ParallelLMHead initialization - Ensures Eagle3 models have proper vocabulary size configuration
- Break long lines to comply with 80 character limit - Improve code readability and maintain style consistency
- Use yapf to automatically format code according to project standards - Improve code consistency and readability
…patibility - Add SupportsEagle3 interface to GptOssForCausalLM class - Implement set_aux_hidden_state_layers and get_eagle3_aux_hidden_state_layers methods - Fix embedding attribute compatibility in Eagle3 loader to handle both embed_tokens and embedding - Support dynamic detection of embedding layer names across different model architectures - Remove unused variable to fix linting errors
- Implement proper layer selection for auxiliary hidden state extraction - Fix forward method to return auxiliary states when EAGLE3 is enabled - Use middle layers for auxiliary state extraction (common EAGLE3 pattern) - Apply yapf formatting to fix line length issues - Ensure compatibility with EAGLE3's expectation of tuple return values
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add some perf numbers to the PR description?
sure currently try to test delete the def _validate_quantization_method(value: Any) -> Any: work or not |
Is the draft model trained in house? Or used any OSS one? |
I ran into same issue with the EAGLE3 model from NVIDIA and fixed it with this change
|
- Add GPT-OSS specific configuration handling in _setup_model_specific_config - Ensure fc layer input dimensions match target model's hidden size - Fix RuntimeError: mat1 and mat2 shapes cannot be multiplied This resolves the dimension mismatch between GPT-OSS-120B (16128) and EAGLE3 model expectations (8640x2880).
- Add debug prints to understand fc layer dimensions - Help diagnose matrix dimension mismatch issue
- Remove unused _setup_model_specific_config method - Remove debug print statements - Keep only essential combine_hidden_states override for dimension mismatch handling - Simplify the class to focus on core functionality
- Ensure dynamically created fc layer is on the same device as hidden_states - Add .to(device) to move the new Linear layer to GPU - Resolves RuntimeError: Expected all tensors to be on the same device
- Add target_hidden_size configuration in _ensure_draft_vocab_size method - Ensure EAGLE3 model uses correct input dimensions for fc layer - Remove runtime dimension mismatch fixes (no longer needed) - Fix code formatting issues This should resolve the matrix dimension mismatch errors for GPT-OSS-120B with EAGLE3
This pull request has merge conflicts that must be resolved before it can be |
- Add debug prints in Eagle3LlamaForCausalLM and LlamaForCausalLMEagle3 constructors - This will help identify if our custom class is being instantiated - Needed to debug why matrix dimension mismatch still occurs
- Adopt main branch's solution for LlamaForCausalLMEagle3 mapping - LlamaForCausalLMEagle3 now maps to Eagle3LlamaForCausalLM class - Re-enable EAGLE3 tests as fixed in main branch - Remove TODO comments that are no longer needed
- Remove debug prints from Eagle3LlamaForCausalLM and LlamaForCausalLMEagle3 - Main branch has already fixed the EAGLE3 issues - Clean up code for production use
- Add _ensure_draft_vocab_size method to Eagle3LlamaForCausalLM class - This ensures draft_vocab_size and target_hidden_size are properly set - Fixes TypeError when draft_vocab_size is None in ParallelLMHead initialization - Now LlamaForCausalLMEagle3 models can work with Eagle3LlamaForCausalLM class
- Remove empty string checks in SpeculativeConfig.__post_init__ - Remove empty string checks in _verify_quantization method - Empty strings should be handled at CLI level via optional_type function - Python API should not pass empty strings for quantization methods
- Add else branch to set quant_cfg = None when quant_algo is None - This prevents quant_cfg from being undefined in ModelOpt quantization handling - Fixes issue with EAGLE3 models from NVIDIA that don't specify quant_algo
- Add debug prints in combine_hidden_states to show input/output shapes - This will help identify why fc layer dimensions don't match - Need to understand target_hidden_size vs actual input dimensions
- Add logic to resize fc layer based on actual weight dimensions - When target_hidden_size is None, infer correct dimensions from loaded weights - This fixes the matrix dimension mismatch (16128 vs 8640) issue - Update target_hidden_size in config after resizing for consistency
- Move dynamic fc layer resizing from load_weights to combine_hidden_states - This ensures the layer is resized at runtime when dimensions don't match - Add more debug output to track the resizing process - This should fix the matrix dimension mismatch issue
still not fix it , have some problem with chat |
Purpose
This PR adds EAGLE3 speculative decoding support for GPT-OSS models, enabling significant inference speedup through parallel token generation.
Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.md
andexamples
for a new model.