Add`.generate()` function with KV cache support for Nemotron v3 by pzelasko · Pull Request #1332 · NVIDIA-NeMo/Automodel

pzelasko · 2026-02-18T23:33:09Z

What does this PR do ?

Add.generate() function with KV cache support for Nemotron v3.

Changelog

Add.generate() function with KV cache support for Nemotron v3

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?

If you haven't finished some of the above items you can still open "Draft" PR.

Additional Information

Related to # (issue)

Adds GenerationMixin to NemotronHForCausalLM so the model exposes the standard HuggingFace .generate() API (greedy, beam search, sampling, etc.). Key changes: - Inherit from transformers.generation.GenerationMixin - Set _is_stateful=True to prevent DynamicCache creation (hybrid Mamba2/Attention architecture cannot use a standard KV cache) - Add device property required by GenerationMixin - Initialize self.generation_config in __init__ - Update forward() to return CausalLMOutputWithPast (logits, optional loss, past_key_values=None) and accept all standard CausalLM params (inputs_embeds, labels, use_cache, cache_position, position_ids, logits_to_keep) - Override prepare_inputs_for_generation() to pass the full accumulated sequence each step (no cache slicing; correct but O(n) per token) - Update tests to use output.logits instead of raw tensor return value - Add 7 new tests covering GenerationMixin integration and .generate() Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…orCausalLM Add NemotronHybridCache managing attention KV cache and Mamba2 conv/SSM state, enabling efficient autoregressive generation. Mamba2 mixer now supports three code paths: fused training, unfused prefill with cache init, and single-step decode with state update. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

copy-pr-bot · 2026-02-18T23:33:12Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

pzelasko and others added 2 commits February 18, 2026 12:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add`.generate()` function with KV cache support for Nemotron v3#1332

Add`.generate()` function with KV cache support for Nemotron v3#1332
pzelasko wants to merge 2 commits intomainfrom
feat/nemotron-generate-support

pzelasko commented Feb 18, 2026

Uh oh!

copy-pr-bot bot commented Feb 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

pzelasko commented Feb 18, 2026

What does this PR do ?

Changelog

Before your PR is "Ready for review"

Additional Information

Uh oh!

copy-pr-bot bot commented Feb 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments