Skip to content

Add.generate() function with KV cache support for Nemotron v3#1332

Draft
pzelasko wants to merge 2 commits intomainfrom
feat/nemotron-generate-support
Draft

Add.generate() function with KV cache support for Nemotron v3#1332
pzelasko wants to merge 2 commits intomainfrom
feat/nemotron-generate-support

Conversation

@pzelasko
Copy link
Contributor

What does this PR do ?

Add.generate() function with KV cache support for Nemotron v3.

Changelog

  • Add.generate() function with KV cache support for Nemotron v3

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?

If you haven't finished some of the above items you can still open "Draft" PR.

Additional Information

  • Related to # (issue)

pzelasko and others added 2 commits February 18, 2026 12:17
Adds GenerationMixin to NemotronHForCausalLM so the model exposes the
standard HuggingFace .generate() API (greedy, beam search, sampling, etc.).

Key changes:
- Inherit from transformers.generation.GenerationMixin
- Set _is_stateful=True to prevent DynamicCache creation (hybrid
  Mamba2/Attention architecture cannot use a standard KV cache)
- Add device property required by GenerationMixin
- Initialize self.generation_config in __init__
- Update forward() to return CausalLMOutputWithPast (logits, optional
  loss, past_key_values=None) and accept all standard CausalLM params
  (inputs_embeds, labels, use_cache, cache_position, position_ids,
  logits_to_keep)
- Override prepare_inputs_for_generation() to pass the full accumulated
  sequence each step (no cache slicing; correct but O(n) per token)
- Update tests to use output.logits instead of raw tensor return value
- Add 7 new tests covering GenerationMixin integration and .generate()

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…orCausalLM

Add NemotronHybridCache managing attention KV cache and Mamba2 conv/SSM
state, enabling efficient autoregressive generation. Mamba2 mixer now
supports three code paths: fused training, unfused prefill with cache
init, and single-step decode with state update.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@copy-pr-bot
Copy link

copy-pr-bot bot commented Feb 18, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant

Comments