-
Notifications
You must be signed in to change notification settings - Fork 237
Support KIMI K2 Thinking int4 checkpoint PTQ #669
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Chenjie Luo <[email protected]>
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
Edwardf0t1
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the ckpt generated identical to @jingyu-ml previously generated nvfp4 ckpt?
| pass | ||
|
|
||
| try: | ||
| from compressed_tensors.linear.compressed_linear import CompressedLinear |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add compressed-tensor as an optional dependency?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kevalmorabia97 @realAsma what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a user is quantizing a model with CompressedLinear, wouldn't they already have compressed-tensors pre-installed? What benefit do we have by having it added as an optional dependency?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
compressed-tensors's main dependencies are torch and transformers so should be pretty lightweight to add as a dependency so fine if you want to add. But if its not commonly used by customers, perhaps we can skip it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we move this to a seperate file modelopt/torch/quantization/plugins/compressed_tensor.py?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a user is quantizing a model with CompressedLinear, wouldn't they already have compressed-tensors pre-installed?
This is a good point. +1
Are we planning to have any unit tests for compressed tensor integration?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not right now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we move this to a seperate file
modelopt/torch/quantization/plugins/compressed_tensor.py?
How strong do you feel about it? Right now I feel this still fall under hf plugins as it's part of the HF's invocation.
Signed-off-by: Chenjie Luo <[email protected]>
Signed-off-by: Chenjie Luo <[email protected]>
|
@cjluo-nv Did we run any deployment and accuracy test for the ckpt generated with this flow to make sure it's correct? Asking because there's a customer who wants to generate the ckpt by themselves. In addition, I heard from @jingyu-ml that we need to modify modeling_deepseek.py to enable our PTQ flow. |
with transformers 4.48, we don't need to modify the original model. I have not got a chance to validate the checkpoint yet. Will probably continue after Christmas break. @Edwardf0t1 is this an urgent request? |
| # When computing the device_map, assuming bfloat16 precision by default, | ||
| # unless specified by the hf_config. | ||
| torch_dtype = getattr(hf_config, "torch_dtype", torch.float16) | ||
| torch_dtype = getattr(hf_config, "torch_dtype", torch.bfloat16) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change is suspicious to me.
Won't this impact other models?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
most likely not. The torch_dtype should be set for most of the model cards.
Signed-off-by: Chenjie Luo <[email protected]>
Signed-off-by: Chenjie Luo <[email protected]>
Hello @cjluo-nv. Please specify which exactly transformers version you have tested this with. I've just used 4.48.0 and quickly hit some import issues (for |
Signed-off-by: Chenjie Luo <[email protected]>
📝 WalkthroughWalkthroughChanges extend quantization support for HuggingFace models with a new QuantCompressedLinear module, adjust model loading for pack-quantized format, disable deepseek-specific MLA quantization, refine export/cleanup procedures, and establish bfloat16 as the default precision instead of float16. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
5328ae9 to
66ac7a6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
🤖 Fix all issues with AI agents
In `@examples/llm_ptq/example_utils.py`:
- Around line 354-361: The current check uses
hf_config.quantization_config.get("format", None) which can raise AttributeError
if quantization_config is None or not a dict; instead, defensively read
quant_config first (e.g., quant_config = getattr(hf_config,
"quantization_config", None) or {}), derive quant_format safely (use
dict.get("format") if isinstance(quant_config, dict) else getattr(quant_config,
"format", None)), then compare quant_format == "pack-quantized" before
constructing torch_dtype and calling AutoModelForCausalLM.from_pretrained with
device_map="auto", trust_remote_code and torch_dtype.
- Around line 190-194: The deepseek MLA-disable branch is unreachable because
the `if model_type == "deepseek"` block is nested inside the `if model_type in
["qwen3moe", "qwen3next"] and qformat == "nvfp4":` condition; pull the deepseek
logic out into its own top-level conditional (separate from the
qwen3moe/qwen3next+nvfp4 check) so it runs when `model_type == "deepseek"`, and
keep the existing modifications to `quant_cfg` that set
`quant_cfg["quant_cfg"]["*self_attn.q*"] = {"enable": False}` and
`quant_cfg["quant_cfg"]["*self_attn.kv*"] = {"enable": False}` intact.
In `@examples/llm_ptq/hf_ptq.py`:
- Around line 618-636: Initialize generated_ids_after_ptq before the
verbose/conditional block so it always exists regardless of args.verbose; e.g.,
set generated_ids_after_ptq = None prior to any if args.verbose / generation
logic (the symbol to update is generated_ids_after_ptq and ensure any later
references to it after the verbose block handle the None case or are guarded),
leaving the rest of the generation branches (full_model.generate,
run_nemotron_vl_preview, and the Llama4 warning) unchanged.
In `@modelopt/torch/quantization/plugins/huggingface.py`:
- Around line 605-612: In unpack_weight, avoid unguarded deletions that can
raise AttributeError: only delete self.weight_packed and self.weight_scale if
they exist (e.g., check hasattr(self, "weight_packed") / hasattr(self,
"weight_scale")) or perform the deletes inside the branch where
self.quantization_status == QuantizationStatus.COMPRESSED after successful
decompression; reference the unpack_weight method and the
QuantizationStatus.COMPRESSED check to ensure deletions occur only when
appropriate.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
examples/llm_ptq/example_utils.pyexamples/llm_ptq/hf_ptq.pyexamples/llm_ptq/scripts/huggingface_example.shmodelopt/torch/export/layer_utils.pymodelopt/torch/export/quant_utils.pymodelopt/torch/export/unified_export_hf.pymodelopt/torch/quantization/plugins/huggingface.py
🧰 Additional context used
🧬 Code graph analysis (3)
modelopt/torch/export/unified_export_hf.py (1)
modelopt/torch/quantization/plugins/huggingface.py (1)
unpack_weight(605-611)
modelopt/torch/quantization/plugins/huggingface.py (1)
modelopt/torch/quantization/conversion.py (1)
register(330-371)
examples/llm_ptq/hf_ptq.py (1)
examples/llm_ptq/example_utils.py (1)
run_nemotron_vl_preview(49-93)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: linux
- GitHub Check: wait-checks / wait
- GitHub Check: wait-checks / wait
- GitHub Check: build-docs
- GitHub Check: code-quality
🔇 Additional comments (8)
modelopt/torch/export/layer_utils.py (1)
346-349: LGTM!The updated
is_quantlinearlogic correctly extends detection to includeQuantCompressedLinearwhile maintaining the existingQuantLineardetection andloraexclusion. The implementation aligns with the new_QuantCompressedLinearclass added in the quantization plugins.modelopt/torch/quantization/plugins/huggingface.py (1)
689-697: LGTM!The registration follows the established pattern used for other optional HuggingFace model types (Llama4, Dbrx, Mixtral, etc.) with proper ImportError handling for the optional
compressed-tensorsdependency.modelopt/torch/export/quant_utils.py (1)
884-890: LGTM!Adding
"weight_shape"to the skip_keys set correctly filters out metadata attributes from CompressedLinear modules that shouldn't be included in the exported state dict. This aligns with the new_QuantCompressedLinearsupport.modelopt/torch/export/unified_export_hf.py (3)
395-396: LGTM with a note on performance.The explicit CUDA cache clearing helps prevent OOM during export of large models. Note that calling
empty_cache()after every layer can add some overhead for models with many layers, but the memory safety benefit likely outweighs this for the target use case (KIMI K2 which is a large MoE model).
523-525: LGTM!The weight unpacking step correctly prepares CompressedLinear modules for export by decompressing packed weights before quantization processing. The
hasattrguard ensures this only applies to modules with theweight_packedattribute.
597-600: LGTM!Removing the
hf_quantizerattribute before callingsave_pretrainedensures that HuggingFace's serialization doesn't interfere with the custom quantized state dict export. Thegetattrwith defaultNonesafely handles models without this attribute.examples/llm_ptq/scripts/huggingface_example.sh (1)
56-58: LGTM!The
nvfp4_mlp_onlyformat is correctly added to both the validation list and the corresponding error message, maintaining consistency with the new quantization format support inhf_ptq.py.examples/llm_ptq/example_utils.py (1)
382-388: Default dtype change fromfloat16tobfloat16looks good.This is a sensible default for modern GPUs (Ampere and newer) with better numerical stability. The
getattrpattern preserves model-specified dtypes when available.Note: Pre-Ampere GPUs may have limited bfloat16 support, but this should be fine as calibration workloads typically target newer hardware.
✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.
examples/llm_ptq/example_utils.py
Outdated
| if model_type == "deepseek": | ||
| # Disable MLA quantization for accuracy. | ||
| quant_cfg["quant_cfg"]["*self_attn.q*"] = {"enable": False} | ||
| quant_cfg["quant_cfg"]["*self_attn.kv*"] = {"enable": False} | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Deepseek MLA disable logic is unreachable due to incorrect nesting.
The if model_type == "deepseek" block is nested inside the if model_type in ["qwen3moe", "qwen3next"] and qformat == "nvfp4": condition (line 183). Since model_type cannot simultaneously be "deepseek" and in ["qwen3moe", "qwen3next"], this code will never execute.
🐛 Proposed fix - move to a separate conditional block
if model_type in ["qwen3moe", "qwen3next"] and qformat == "nvfp4":
# Disable the attention projection layers to retain accuracy
quant_cfg["quant_cfg"]["model*.*attn*in_proj*"] = {"enable": False}
quant_cfg["quant_cfg"]["model*.*attn*q_proj*"] = {"enable": False}
quant_cfg["quant_cfg"]["model*.*attn*k_proj*"] = {"enable": False}
quant_cfg["quant_cfg"]["model*.*attn*v_proj*"] = {"enable": False}
- if model_type == "deepseek":
- # Disable MLA quantization for accuracy.
- quant_cfg["quant_cfg"]["*self_attn.q*"] = {"enable": False}
- quant_cfg["quant_cfg"]["*self_attn.kv*"] = {"enable": False}
+ if model_type == "deepseek":
+ # Disable MLA quantization for accuracy.
+ quant_cfg["quant_cfg"]["*self_attn.q*"] = {"enable": False}
+ quant_cfg["quant_cfg"]["*self_attn.kv*"] = {"enable": False}
return quant_cfg🤖 Prompt for AI Agents
In `@examples/llm_ptq/example_utils.py` around lines 190 - 194, The deepseek
MLA-disable branch is unreachable because the `if model_type == "deepseek"`
block is nested inside the `if model_type in ["qwen3moe", "qwen3next"] and
qformat == "nvfp4":` condition; pull the deepseek logic out into its own
top-level conditional (separate from the qwen3moe/qwen3next+nvfp4 check) so it
runs when `model_type == "deepseek"`, and keep the existing modifications to
`quant_cfg` that set `quant_cfg["quant_cfg"]["*self_attn.q*"] = {"enable":
False}` and `quant_cfg["quant_cfg"]["*self_attn.kv*"] = {"enable": False}`
intact.
examples/llm_ptq/example_utils.py
Outdated
| elif hf_config.quantization_config.get("format", None) == "pack-quantized": | ||
| torch_dtype = getattr(hf_config, "torch_dtype", torch.bfloat16) | ||
| model = AutoModelForCausalLM.from_pretrained( | ||
| ckpt_path, | ||
| device_map="auto", | ||
| trust_remote_code=trust_remote_code, | ||
| torch_dtype=torch_dtype, | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential AttributeError if quantization_config is not present or not a dict.
hf_config.quantization_config.get("format", None) will raise an AttributeError if quantization_config is None or doesn't exist on the config. Consider using getattr with a safe fallback.
🐛 Proposed fix
- elif hf_config.quantization_config.get("format", None) == "pack-quantized":
+ elif getattr(hf_config, "quantization_config", None) and hf_config.quantization_config.get("format", None) == "pack-quantized":
torch_dtype = getattr(hf_config, "torch_dtype", torch.bfloat16)
model = AutoModelForCausalLM.from_pretrained(
ckpt_path,
device_map="auto",
trust_remote_code=trust_remote_code,
torch_dtype=torch_dtype,
)Alternatively, use a more defensive pattern:
elif getattr(getattr(hf_config, "quantization_config", {}), "get", lambda k, d: d)("format", None) == "pack-quantized":Or extract the check:
quant_config = getattr(hf_config, "quantization_config", None) or {}
quant_format = quant_config.get("format") if isinstance(quant_config, dict) else getattr(quant_config, "format", None)
...
elif quant_format == "pack-quantized":🤖 Prompt for AI Agents
In `@examples/llm_ptq/example_utils.py` around lines 354 - 361, The current check
uses hf_config.quantization_config.get("format", None) which can raise
AttributeError if quantization_config is None or not a dict; instead,
defensively read quant_config first (e.g., quant_config = getattr(hf_config,
"quantization_config", None) or {}), derive quant_format safely (use
dict.get("format") if isinstance(quant_config, dict) else getattr(quant_config,
"format", None)), then compare quant_format == "pack-quantized" before
constructing torch_dtype and calling AutoModelForCausalLM.from_pretrained with
device_map="auto", trust_remote_code and torch_dtype.
examples/llm_ptq/hf_ptq.py
Outdated
| # Run some samples | ||
| torch.cuda.empty_cache() | ||
| generated_ids_after_ptq = None | ||
| if model_type != "llama4" and not is_nemotron_vl_model: | ||
| # Our fake quantizer may not be fully compatible with torch.compile. | ||
| generated_ids_after_ptq = full_model.generate(preview_input_ids, max_new_tokens=100) | ||
| elif is_nemotron_vl_model and tokenizer is not None: | ||
| generated_ids_after_ptq = run_nemotron_vl_preview( | ||
| full_model, | ||
| tokenizer, | ||
| preview_input_ids, | ||
| args.pyt_ckpt_path, | ||
| "after quantization", | ||
| allow_fallback=False, | ||
| ) | ||
| else: | ||
| warnings.warn( | ||
| "Llama4 Maverick generation after quantization has a bug. Skipping generation sample." | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential NameError when verbose=False.
The variable generated_ids_after_ptq is only defined inside the if args.verbose: block (line 620), but it's referenced at line 661 outside of any conditional. When --no-verbose is passed, this will raise a NameError.
🐛 Proposed fix
Initialize generated_ids_after_ptq before the verbose block:
+ generated_ids_after_ptq = None
if args.verbose:
mtq.print_quant_summary(full_model)
# Run some samples
torch.cuda.empty_cache()
- generated_ids_after_ptq = None
if model_type != "llama4" and not is_nemotron_vl_model:🤖 Prompt for AI Agents
In `@examples/llm_ptq/hf_ptq.py` around lines 618 - 636, Initialize
generated_ids_after_ptq before the verbose/conditional block so it always exists
regardless of args.verbose; e.g., set generated_ids_after_ptq = None prior to
any if args.verbose / generation logic (the symbol to update is
generated_ids_after_ptq and ensure any later references to it after the verbose
block handle the None case or are guarded), leaving the rest of the generation
branches (full_model.generate, run_nemotron_vl_preview, and the Llama4 warning)
unchanged.
| def unpack_weight(self): | ||
| from compressed_tensors.quantization import QuantizationStatus | ||
|
|
||
| if self.quantization_status == QuantizationStatus.COMPRESSED: | ||
| self.weight = nn.Parameter(self.compressor.decompress_module(self), requires_grad=False) | ||
| del self.weight_packed | ||
| del self.weight_scale | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Guard against missing attributes in unpack_weight.
The del self.weight_packed and del self.weight_scale statements are executed unconditionally, but these attributes may not exist if quantization_status is not COMPRESSED or if they were already deleted. This could raise an AttributeError.
🐛 Proposed fix
def unpack_weight(self):
from compressed_tensors.quantization import QuantizationStatus
if self.quantization_status == QuantizationStatus.COMPRESSED:
self.weight = nn.Parameter(self.compressor.decompress_module(self), requires_grad=False)
- del self.weight_packed
- del self.weight_scale
+ if hasattr(self, "weight_packed"):
+ del self.weight_packed
+ if hasattr(self, "weight_scale"):
+ del self.weight_scale🤖 Prompt for AI Agents
In `@modelopt/torch/quantization/plugins/huggingface.py` around lines 605 - 612,
In unpack_weight, avoid unguarded deletions that can raise AttributeError: only
delete self.weight_packed and self.weight_scale if they exist (e.g., check
hasattr(self, "weight_packed") / hasattr(self, "weight_scale")) or perform the
deletes inside the branch where self.quantization_status ==
QuantizationStatus.COMPRESSED after successful decompression; reference the
unpack_weight method and the QuantizationStatus.COMPRESSED check to ensure
deletions occur only when appropriate.
…l/kimik2-thinking Signed-off-by: Chenjie Luo <[email protected]>
66ac7a6 to
e8b7fc6
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #669 +/- ##
=======================================
Coverage 74.23% 74.23%
=======================================
Files 192 192
Lines 19033 19033
=======================================
Hits 14129 14129
Misses 4904 4904 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Signed-off-by: Chenjie Luo <[email protected]>
Signed-off-by: Chenjie Luo <[email protected]>
Signed-off-by: Chenjie Luo <[email protected]>
…l/kimik2-thinking
Signed-off-by: Chenjie Luo <[email protected]>
I have updated that in the PR description. |
What does this PR do?
Type of change: ? new feature
Overview:
Support KIMI K2 Thinking PTQ from the original int4 checkpoint.
Tested with transformers 4.57.1, compressed-tensors 0.12.0
The model weights are dequantized on the fly to save GPU memory
Usage
scripts/huggingface_example.sh --model --quant nvfp4_mlp_only --trust_remote_code
Before your PR is "Ready for review"
Additional Information
Summary by CodeRabbit
New Features
Improvements
✏️ Tip: You can customize this high-level summary in your review settings.