Skip to content

Conversation

@cjluo-nv
Copy link
Collaborator

@cjluo-nv cjluo-nv commented Dec 9, 2025

What does this PR do?

Type of change: ? new feature

Overview:

Support KIMI K2 Thinking PTQ from the original int4 checkpoint.
Tested with transformers 4.57.1, compressed-tensors 0.12.0

The model weights are dequantized on the fly to save GPU memory

Usage

scripts/huggingface_example.sh --model --quant nvfp4_mlp_only --trust_remote_code

Before your PR is "Ready for review"

  • Make sure you read and follow Contributor guidelines and your commits are signed.
  • Is this change backward compatible?: Yes/No
  • Did you write any new necessary tests?: Yes/No
  • Did you add or update any necessary documentation?: Yes/No
  • Did you update Changelog?: Yes

Additional Information

Summary by CodeRabbit

  • New Features

    • Support for nvfp4_mlp_only quantization format, enabling new layer-wise quantization options
    • Quantization support for CompressedLinear layers in quantized models
  • Improvements

    • Enhanced quantization for DeepSeek models with improved attention configuration handling
    • Optimized model loading with automatic precision configuration and weight unpacking
    • Better memory management during model export with automatic cache cleanup
    • Conditional sample generation output controlled via verbose mode

✏️ Tip: You can customize this high-level summary in your review settings.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Dec 9, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

Copy link
Contributor

@Edwardf0t1 Edwardf0t1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the ckpt generated identical to @jingyu-ml previously generated nvfp4 ckpt?

pass

try:
from compressed_tensors.linear.compressed_linear import CompressedLinear
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add compressed-tensor as an optional dependency?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kevalmorabia97 @realAsma what do you think?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a user is quantizing a model with CompressedLinear, wouldn't they already have compressed-tensors pre-installed? What benefit do we have by having it added as an optional dependency?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

compressed-tensors's main dependencies are torch and transformers so should be pretty lightweight to add as a dependency so fine if you want to add. But if its not commonly used by customers, perhaps we can skip it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move this to a seperate file modelopt/torch/quantization/plugins/compressed_tensor.py?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a user is quantizing a model with CompressedLinear, wouldn't they already have compressed-tensors pre-installed?

This is a good point. +1
Are we planning to have any unit tests for compressed tensor integration?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not right now

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move this to a seperate file modelopt/torch/quantization/plugins/compressed_tensor.py?

How strong do you feel about it? Right now I feel this still fall under hf plugins as it's part of the HF's invocation.

Signed-off-by: Chenjie Luo <[email protected]>
@cjluo-nv cjluo-nv marked this pull request as ready for review December 9, 2025 17:29
@cjluo-nv cjluo-nv requested review from a team as code owners December 9, 2025 17:29
@cjluo-nv cjluo-nv requested a review from meenchen December 9, 2025 17:29
Signed-off-by: Chenjie Luo <[email protected]>
@cjluo-nv cjluo-nv requested a review from meenchen December 10, 2025 00:12
@Edwardf0t1
Copy link
Contributor

Edwardf0t1 commented Dec 11, 2025

@cjluo-nv Did we run any deployment and accuracy test for the ckpt generated with this flow to make sure it's correct? Asking because there's a customer who wants to generate the ckpt by themselves.

In addition, I heard from @jingyu-ml that we need to modify modeling_deepseek.py to enable our PTQ flow.

@cjluo-nv
Copy link
Collaborator Author

@cjluo-nv Did we run any deployment and accuracy test for the ckpt generated with this flow to make sure it's correct? Asking because there's a customer who wants to generate the ckpt by themselves.

In addition, I heard from @jingyu-ml that we need to modify modeling_deepseek.py to enable our PTQ flow.

with transformers 4.48, we don't need to modify the original model.

I have not got a chance to validate the checkpoint yet. Will probably continue after Christmas break. @Edwardf0t1 is this an urgent request?

# When computing the device_map, assuming bfloat16 precision by default,
# unless specified by the hf_config.
torch_dtype = getattr(hf_config, "torch_dtype", torch.float16)
torch_dtype = getattr(hf_config, "torch_dtype", torch.bfloat16)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is suspicious to me.

Won't this impact other models?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

most likely not. The torch_dtype should be set for most of the model cards.

Signed-off-by: Chenjie Luo <[email protected]>
@cjluo-nv cjluo-nv requested a review from a team as a code owner January 5, 2026 23:50
@cjluo-nv cjluo-nv requested a review from sugunav14 January 5, 2026 23:50
Signed-off-by: Chenjie Luo <[email protected]>
@janekl
Copy link

janekl commented Jan 7, 2026

Tested with transformers 4.48

Hello @cjluo-nv. Please specify which exactly transformers version you have tested this with. I've just used 4.48.0 and quickly hit some import issues (for transformers.pytorch_utils module). They that are no longer present in 4.48.3 - I'm trying now this version

Signed-off-by: Chenjie Luo <[email protected]>
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 14, 2026

📝 Walkthrough

Walkthrough

Changes extend quantization support for HuggingFace models with a new QuantCompressedLinear module, adjust model loading for pack-quantized format, disable deepseek-specific MLA quantization, refine export/cleanup procedures, and establish bfloat16 as the default precision instead of float16.

Changes

Cohort / File(s) Summary
Quantization Module Extensions
modelopt/torch/quantization/plugins/huggingface.py, modelopt/torch/export/layer_utils.py, modelopt/torch/export/quant_utils.py
Added new _QuantCompressedLinear class with TensorQuantizer integration and weight decompression support; expanded is_quantlinear() to recognize QuantCompressedLinear in addition to QuantLinear; extended state dict key filtering to skip "weight_shape" entries during export processing.
Model Loading & Configuration
examples/llm_ptq/example_utils.py
Enabled pack-quantized format loading with auto device mapping and dtype from HF config; added deepseek-specific quantization adjustments to disable MLA quantization; changed default precision from torch.float16 to torch.bfloat16.
Quantization Pipeline
examples/llm_ptq/hf_ptq.py, examples/llm_ptq/scripts/huggingface_example.sh
Conditionalized post-quantization sample generation to only execute when verbose mode is enabled; expanded valid QFORMAT values to include nvfp4_mlp_only.
Export & Cleanup
modelopt/torch/export/unified_export_hf.py
Added explicit CUDA cache clearing after quantized weight export, weight unpacking for submodules with packed weights, and removal of hf_quantizer attribute from model before final save.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The pull request title directly describes the main change: adding support for post-training quantization (PTQ) of KIMI K2 Thinking int4 checkpoints, which aligns with the primary objective and the substantial code changes across quantization and export modules.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@cjluo-nv cjluo-nv force-pushed the chenjiel/kimik2-thinking branch from 5328ae9 to 66ac7a6 Compare January 15, 2026 00:01
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Fix all issues with AI agents
In `@examples/llm_ptq/example_utils.py`:
- Around line 354-361: The current check uses
hf_config.quantization_config.get("format", None) which can raise AttributeError
if quantization_config is None or not a dict; instead, defensively read
quant_config first (e.g., quant_config = getattr(hf_config,
"quantization_config", None) or {}), derive quant_format safely (use
dict.get("format") if isinstance(quant_config, dict) else getattr(quant_config,
"format", None)), then compare quant_format == "pack-quantized" before
constructing torch_dtype and calling AutoModelForCausalLM.from_pretrained with
device_map="auto", trust_remote_code and torch_dtype.
- Around line 190-194: The deepseek MLA-disable branch is unreachable because
the `if model_type == "deepseek"` block is nested inside the `if model_type in
["qwen3moe", "qwen3next"] and qformat == "nvfp4":` condition; pull the deepseek
logic out into its own top-level conditional (separate from the
qwen3moe/qwen3next+nvfp4 check) so it runs when `model_type == "deepseek"`, and
keep the existing modifications to `quant_cfg` that set
`quant_cfg["quant_cfg"]["*self_attn.q*"] = {"enable": False}` and
`quant_cfg["quant_cfg"]["*self_attn.kv*"] = {"enable": False}` intact.

In `@examples/llm_ptq/hf_ptq.py`:
- Around line 618-636: Initialize generated_ids_after_ptq before the
verbose/conditional block so it always exists regardless of args.verbose; e.g.,
set generated_ids_after_ptq = None prior to any if args.verbose / generation
logic (the symbol to update is generated_ids_after_ptq and ensure any later
references to it after the verbose block handle the None case or are guarded),
leaving the rest of the generation branches (full_model.generate,
run_nemotron_vl_preview, and the Llama4 warning) unchanged.

In `@modelopt/torch/quantization/plugins/huggingface.py`:
- Around line 605-612: In unpack_weight, avoid unguarded deletions that can
raise AttributeError: only delete self.weight_packed and self.weight_scale if
they exist (e.g., check hasattr(self, "weight_packed") / hasattr(self,
"weight_scale")) or perform the deletes inside the branch where
self.quantization_status == QuantizationStatus.COMPRESSED after successful
decompression; reference the unpack_weight method and the
QuantizationStatus.COMPRESSED check to ensure deletions occur only when
appropriate.
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 406c18c and 5328ae9.

📒 Files selected for processing (7)
  • examples/llm_ptq/example_utils.py
  • examples/llm_ptq/hf_ptq.py
  • examples/llm_ptq/scripts/huggingface_example.sh
  • modelopt/torch/export/layer_utils.py
  • modelopt/torch/export/quant_utils.py
  • modelopt/torch/export/unified_export_hf.py
  • modelopt/torch/quantization/plugins/huggingface.py
🧰 Additional context used
🧬 Code graph analysis (3)
modelopt/torch/export/unified_export_hf.py (1)
modelopt/torch/quantization/plugins/huggingface.py (1)
  • unpack_weight (605-611)
modelopt/torch/quantization/plugins/huggingface.py (1)
modelopt/torch/quantization/conversion.py (1)
  • register (330-371)
examples/llm_ptq/hf_ptq.py (1)
examples/llm_ptq/example_utils.py (1)
  • run_nemotron_vl_preview (49-93)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: linux
  • GitHub Check: wait-checks / wait
  • GitHub Check: wait-checks / wait
  • GitHub Check: build-docs
  • GitHub Check: code-quality
🔇 Additional comments (8)
modelopt/torch/export/layer_utils.py (1)

346-349: LGTM!

The updated is_quantlinear logic correctly extends detection to include QuantCompressedLinear while maintaining the existing QuantLinear detection and lora exclusion. The implementation aligns with the new _QuantCompressedLinear class added in the quantization plugins.

modelopt/torch/quantization/plugins/huggingface.py (1)

689-697: LGTM!

The registration follows the established pattern used for other optional HuggingFace model types (Llama4, Dbrx, Mixtral, etc.) with proper ImportError handling for the optional compressed-tensors dependency.

modelopt/torch/export/quant_utils.py (1)

884-890: LGTM!

Adding "weight_shape" to the skip_keys set correctly filters out metadata attributes from CompressedLinear modules that shouldn't be included in the exported state dict. This aligns with the new _QuantCompressedLinear support.

modelopt/torch/export/unified_export_hf.py (3)

395-396: LGTM with a note on performance.

The explicit CUDA cache clearing helps prevent OOM during export of large models. Note that calling empty_cache() after every layer can add some overhead for models with many layers, but the memory safety benefit likely outweighs this for the target use case (KIMI K2 which is a large MoE model).


523-525: LGTM!

The weight unpacking step correctly prepares CompressedLinear modules for export by decompressing packed weights before quantization processing. The hasattr guard ensures this only applies to modules with the weight_packed attribute.


597-600: LGTM!

Removing the hf_quantizer attribute before calling save_pretrained ensures that HuggingFace's serialization doesn't interfere with the custom quantized state dict export. The getattr with default None safely handles models without this attribute.

examples/llm_ptq/scripts/huggingface_example.sh (1)

56-58: LGTM!

The nvfp4_mlp_only format is correctly added to both the validation list and the corresponding error message, maintaining consistency with the new quantization format support in hf_ptq.py.

examples/llm_ptq/example_utils.py (1)

382-388: Default dtype change from float16 to bfloat16 looks good.

This is a sensible default for modern GPUs (Ampere and newer) with better numerical stability. The getattr pattern preserves model-specified dtypes when available.

Note: Pre-Ampere GPUs may have limited bfloat16 support, but this should be fine as calibration workloads typically target newer hardware.

✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.

Comment on lines 190 to 194
if model_type == "deepseek":
# Disable MLA quantization for accuracy.
quant_cfg["quant_cfg"]["*self_attn.q*"] = {"enable": False}
quant_cfg["quant_cfg"]["*self_attn.kv*"] = {"enable": False}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Deepseek MLA disable logic is unreachable due to incorrect nesting.

The if model_type == "deepseek" block is nested inside the if model_type in ["qwen3moe", "qwen3next"] and qformat == "nvfp4": condition (line 183). Since model_type cannot simultaneously be "deepseek" and in ["qwen3moe", "qwen3next"], this code will never execute.

🐛 Proposed fix - move to a separate conditional block
     if model_type in ["qwen3moe", "qwen3next"] and qformat == "nvfp4":
         # Disable the attention projection layers to retain accuracy
         quant_cfg["quant_cfg"]["model*.*attn*in_proj*"] = {"enable": False}
         quant_cfg["quant_cfg"]["model*.*attn*q_proj*"] = {"enable": False}
         quant_cfg["quant_cfg"]["model*.*attn*k_proj*"] = {"enable": False}
         quant_cfg["quant_cfg"]["model*.*attn*v_proj*"] = {"enable": False}

-        if model_type == "deepseek":
-            # Disable MLA quantization for accuracy.
-            quant_cfg["quant_cfg"]["*self_attn.q*"] = {"enable": False}
-            quant_cfg["quant_cfg"]["*self_attn.kv*"] = {"enable": False}
+    if model_type == "deepseek":
+        # Disable MLA quantization for accuracy.
+        quant_cfg["quant_cfg"]["*self_attn.q*"] = {"enable": False}
+        quant_cfg["quant_cfg"]["*self_attn.kv*"] = {"enable": False}

     return quant_cfg
🤖 Prompt for AI Agents
In `@examples/llm_ptq/example_utils.py` around lines 190 - 194, The deepseek
MLA-disable branch is unreachable because the `if model_type == "deepseek"`
block is nested inside the `if model_type in ["qwen3moe", "qwen3next"] and
qformat == "nvfp4":` condition; pull the deepseek logic out into its own
top-level conditional (separate from the qwen3moe/qwen3next+nvfp4 check) so it
runs when `model_type == "deepseek"`, and keep the existing modifications to
`quant_cfg` that set `quant_cfg["quant_cfg"]["*self_attn.q*"] = {"enable":
False}` and `quant_cfg["quant_cfg"]["*self_attn.kv*"] = {"enable": False}`
intact.

Comment on lines 354 to 361
elif hf_config.quantization_config.get("format", None) == "pack-quantized":
torch_dtype = getattr(hf_config, "torch_dtype", torch.bfloat16)
model = AutoModelForCausalLM.from_pretrained(
ckpt_path,
device_map="auto",
trust_remote_code=trust_remote_code,
torch_dtype=torch_dtype,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Potential AttributeError if quantization_config is not present or not a dict.

hf_config.quantization_config.get("format", None) will raise an AttributeError if quantization_config is None or doesn't exist on the config. Consider using getattr with a safe fallback.

🐛 Proposed fix
-        elif hf_config.quantization_config.get("format", None) == "pack-quantized":
+        elif getattr(hf_config, "quantization_config", None) and hf_config.quantization_config.get("format", None) == "pack-quantized":
             torch_dtype = getattr(hf_config, "torch_dtype", torch.bfloat16)
             model = AutoModelForCausalLM.from_pretrained(
                 ckpt_path,
                 device_map="auto",
                 trust_remote_code=trust_remote_code,
                 torch_dtype=torch_dtype,
             )

Alternatively, use a more defensive pattern:

        elif getattr(getattr(hf_config, "quantization_config", {}), "get", lambda k, d: d)("format", None) == "pack-quantized":

Or extract the check:

        quant_config = getattr(hf_config, "quantization_config", None) or {}
        quant_format = quant_config.get("format") if isinstance(quant_config, dict) else getattr(quant_config, "format", None)
        ...
        elif quant_format == "pack-quantized":
🤖 Prompt for AI Agents
In `@examples/llm_ptq/example_utils.py` around lines 354 - 361, The current check
uses hf_config.quantization_config.get("format", None) which can raise
AttributeError if quantization_config is None or not a dict; instead,
defensively read quant_config first (e.g., quant_config = getattr(hf_config,
"quantization_config", None) or {}), derive quant_format safely (use
dict.get("format") if isinstance(quant_config, dict) else getattr(quant_config,
"format", None)), then compare quant_format == "pack-quantized" before
constructing torch_dtype and calling AutoModelForCausalLM.from_pretrained with
device_map="auto", trust_remote_code and torch_dtype.

Comment on lines 618 to 636
# Run some samples
torch.cuda.empty_cache()
generated_ids_after_ptq = None
if model_type != "llama4" and not is_nemotron_vl_model:
# Our fake quantizer may not be fully compatible with torch.compile.
generated_ids_after_ptq = full_model.generate(preview_input_ids, max_new_tokens=100)
elif is_nemotron_vl_model and tokenizer is not None:
generated_ids_after_ptq = run_nemotron_vl_preview(
full_model,
tokenizer,
preview_input_ids,
args.pyt_ckpt_path,
"after quantization",
allow_fallback=False,
)
else:
warnings.warn(
"Llama4 Maverick generation after quantization has a bug. Skipping generation sample."
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Potential NameError when verbose=False.

The variable generated_ids_after_ptq is only defined inside the if args.verbose: block (line 620), but it's referenced at line 661 outside of any conditional. When --no-verbose is passed, this will raise a NameError.

🐛 Proposed fix

Initialize generated_ids_after_ptq before the verbose block:

+    generated_ids_after_ptq = None
     if args.verbose:
         mtq.print_quant_summary(full_model)
 
         # Run some samples
         torch.cuda.empty_cache()
-        generated_ids_after_ptq = None
         if model_type != "llama4" and not is_nemotron_vl_model:
🤖 Prompt for AI Agents
In `@examples/llm_ptq/hf_ptq.py` around lines 618 - 636, Initialize
generated_ids_after_ptq before the verbose/conditional block so it always exists
regardless of args.verbose; e.g., set generated_ids_after_ptq = None prior to
any if args.verbose / generation logic (the symbol to update is
generated_ids_after_ptq and ensure any later references to it after the verbose
block handle the None case or are guarded), leaving the rest of the generation
branches (full_model.generate, run_nemotron_vl_preview, and the Llama4 warning)
unchanged.

Comment on lines 605 to 612
def unpack_weight(self):
from compressed_tensors.quantization import QuantizationStatus

if self.quantization_status == QuantizationStatus.COMPRESSED:
self.weight = nn.Parameter(self.compressor.decompress_module(self), requires_grad=False)
del self.weight_packed
del self.weight_scale

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Guard against missing attributes in unpack_weight.

The del self.weight_packed and del self.weight_scale statements are executed unconditionally, but these attributes may not exist if quantization_status is not COMPRESSED or if they were already deleted. This could raise an AttributeError.

🐛 Proposed fix
     def unpack_weight(self):
         from compressed_tensors.quantization import QuantizationStatus
 
         if self.quantization_status == QuantizationStatus.COMPRESSED:
             self.weight = nn.Parameter(self.compressor.decompress_module(self), requires_grad=False)
-        del self.weight_packed
-        del self.weight_scale
+        if hasattr(self, "weight_packed"):
+            del self.weight_packed
+        if hasattr(self, "weight_scale"):
+            del self.weight_scale
🤖 Prompt for AI Agents
In `@modelopt/torch/quantization/plugins/huggingface.py` around lines 605 - 612,
In unpack_weight, avoid unguarded deletions that can raise AttributeError: only
delete self.weight_packed and self.weight_scale if they exist (e.g., check
hasattr(self, "weight_packed") / hasattr(self, "weight_scale")) or perform the
deletes inside the branch where self.quantization_status ==
QuantizationStatus.COMPRESSED after successful decompression; reference the
unpack_weight method and the QuantizationStatus.COMPRESSED check to ensure
deletions occur only when appropriate.

@cjluo-nv cjluo-nv force-pushed the chenjiel/kimik2-thinking branch from 66ac7a6 to e8b7fc6 Compare January 15, 2026 00:02
@codecov
Copy link

codecov bot commented Jan 15, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 74.23%. Comparing base (406c18c) to head (aa7f8be).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #669   +/-   ##
=======================================
  Coverage   74.23%   74.23%           
=======================================
  Files         192      192           
  Lines       19033    19033           
=======================================
  Hits        14129    14129           
  Misses       4904     4904           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Signed-off-by: Chenjie Luo <[email protected]>
Signed-off-by: Chenjie Luo <[email protected]>
Signed-off-by: Chenjie Luo <[email protected]>
Signed-off-by: Chenjie Luo <[email protected]>
@cjluo-nv
Copy link
Collaborator Author

Tested with transformers 4.48

Hello @cjluo-nv. Please specify which exactly transformers version you have tested this with. I've just used 4.48.0 and quickly hit some import issues (for transformers.pytorch_utils module). They that are no longer present in 4.48.3 - I'm trying now this version

I have updated that in the PR description.

@cjluo-nv cjluo-nv enabled auto-merge (squash) January 15, 2026 19:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants