feat: Nano v3 RL Recipe by yfw · Pull Request #1989 · NVIDIA-NeMo/RL

yfw · 2026-02-18T23:34:51Z

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

Issues

List issues that this PR closes (syntax):

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

...

Summary by CodeRabbit

New Features
- Added comprehensive GRPO training configuration example for Nano v3 with advanced rollout settings, evaluation scheduling, and baseline management options.
- Enhanced training logging with expanded data payloads capturing advantages, generation log probabilities, and per-token loss metrics for deeper training visibility.
- Improved rollout tracking with agent reference extraction and truncation detection across multiple rollout paths for better per-sample analysis.
- Optimized generation backend configuration for improved vllm HTTP server compatibility.

Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>

Co-authored-by: Peter Jin <pjin@nvidia.com> Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>

Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>

coderabbitai · 2026-02-20T18:42:53Z

📝 Walkthrough

Walkthrough

This PR adds a new YAML configuration for GRPO Nano v3 training, enriches GRPO algorithm logging with additional training signals (agent_ref, advantages, token losses), tensorizes rollout data for downstream consistency, and updates the generation initialization logic to consider HTTP server exposure settings.

Changes

Cohort / File(s)	Summary
Configuration `examples/nemo_gym/grpo_nanov3.yaml`	New comprehensive GRPO training configuration defining rollout settings, loss function parameters, policy/model configuration, Megatron-tuned training settings, dynamic batching, vllm generation backend, data sources, Nemo Gym environment, logging, and cluster specifications.
Algorithm Logging `nemo_rl/algorithms/grpo.py`	Expanded log_data payload to include agent_ref, token_ids, token_loss_mask, sample_loss_mask, advantages, and logprob fields. Extended flat_messages lifecycle to persist through logging operations for richer training signal capture.
Rollout Processing `nemo_rl/experience/rollouts.py`	Introduced `_tensorize_by_key()` helper to convert message log keys to torch tensors. Enhanced both async rollout paths to extract and propagate agent_ref field, compute truncated field from hit_max_tokens metrics, and tensorize token_ids across input/generation/message logs.
Generation Configuration `nemo_rl/models/generation/__init__.py`	Updated skip_tokenizer_init logic to additionally trigger tokenizer initialization when vllm_cfg.expose_http_server is present, alongside existing is_eval and stop_strings conditions.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

feat: add speculative decoding during post-training #1785: Modifies same modules (grpo.py and generation/init.py) with similar instrumentation and conditional logic patterns around generation handling.
feat: Add Penguin run #1481: Modifies overlapping rollout code paths with complementary changes to agent_ref/truncated field propagation and tensorization logic.

Suggested labels

CI:L1

Suggested reviewers

terrykong

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Test Results For Major Changes	⚠️ Warning	PR introduces major GRPO training feature but lacks documentation, test results, performance metrics, and contains unfixed critical configuration bugs.	Fix critical bugs (async_engine: true, max_val_samples: 1024), provide complete PR description with test results and performance metrics, confirm end-to-end testing.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'feat: Nano v3 RL Recipe' directly and specifically describes the main change: adding a new Nano v3 configuration recipe for RL training, which is the primary focus of the changeset (new YAML config file, supporting code updates, and related rollout/generation changes).
Docstring Coverage	✅ Passed	Docstring coverage is 80.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch yifu/nano-v3-config

Tip

Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (3)

nemo_rl/algorithms/grpo.py (1)
1964-1978: Logging inconsistency between sync and async GRPO paths.

The sync grpo_train path now logs agent_ref, token_ids, token_loss_mask, sample_loss_mask, and advantages, but the async async_grpo_train path (lines 2983–2990) does not include these fields. If parity is desired for debugging, consider aligning them.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_rl/algorithms/grpo.py` around lines 1964 - 1978, The async_grpo_train
logging is missing several fields present in the sync grpo_train path; update
async_grpo_train to add the same keys into its log_data: include "agent_ref"
from repeated_batch (if present), and copy "token_ids", "token_loss_mask",
"sample_loss_mask", and "advantages" from train_data (using .tolist() like the
sync path), ensuring the same key names and any dynamic-sampling overrides
(e.g., filtered_rewards/rewards) are handled identically so both paths produce
equivalent logs.
examples/nemo_gym/grpo_nanov3.yaml (1)
82-94: Misleading indentation on commented-out config lines.

Lines 83 and 94 are commented-out alternatives indented as if they are children of their preceding keys (bias_activation_fusion and freeze_moe_router respectively). This is confusing for readers. Consider aligning them at the same level as their siblings or adding a brief note about why they're commented out.
Suggested alignment
     bias_activation_fusion: False
-      # converter_type: "Qwen2ForCausalLM"
+    # converter_type: "Qwen2ForCausalLM"  # Not needed for this model
     tensor_model_parallel_size: 2
...
     freeze_moe_router: true
-      # moe_router_dtype: "fp64"
+    # moe_router_dtype: "fp64"  # fp32 used instead for stability
     moe_router_dtype: "fp32"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/nemo_gym/grpo_nanov3.yaml` around lines 82 - 94, The commented-out
config keys (e.g., the commented converter_type near bias_activation_fusion and
the commented moe_router_dtype near freeze_moe_router) are indented as if they
are children of the preceding keys which is misleading; move those commented
lines (converter_type and moe_router_dtype) to the same indentation level as the
other top-level keys (match indentation of tensor_model_parallel_size,
pipeline_dtype, etc.) or add a short inline comment explaining why they're
disabled so readers know they are alternative top-level settings rather than
nested properties.
nemo_rl/experience/rollouts.py (1)
982-988: _tensorize_by_key only checks the first message for key presence.

If some messages in the list lack key while the first one has it, this will raise a KeyError. Consider using m.get(key) or checking each message. In the current call sites this appears safe (all messages in a filtered list should have the key), but the helper is generic enough to surprise future callers.

Also, torch.as_tensor avoids an unnecessary copy when the value is already a tensor, which would be slightly more defensive.
Proposed defensive variant
 def _tensorize_by_key(message_logs: list, key: str):
     if not message_logs or key not in message_logs[0]:
         return
 
     for m in message_logs:
-        m[key] = torch.tensor(m[key])
+        if key in m:
+            m[key] = torch.as_tensor(m[key])
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_rl/experience/rollouts.py` around lines 982 - 988, The helper
_tensorize_by_key currently only checks key existence on message_logs[0] and
then assumes all messages have that key; change it to iterate messages and for
each message check for the key (e.g., if key in m or m.get(key) is not None)
before converting, skipping messages that lack it to avoid KeyError, and use
torch.as_tensor instead of torch.tensor to avoid unnecessary copies when the
value is already a tensor; update the function body to perform per-message
presence check and conversion using torch.as_tensor(m[key]).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/nemo_gym/grpo_nanov3.yaml`:
- Line 191: The config is invalid: _should_use_nemo_gym in grpo.py asserts
_should_use_async_rollouts (which requires vllm_cfg.async_engine == True), so
either set vllm_cfg.async_engine to true (change async_engine: false → true) or
disable should_use_nemo_gym (set should_use_nemo_gym: false); additionally,
avoid max_val_samples being null — set grpo.max_val_samples to a positive
integer or update validation to handle null by providing a default (e.g., fall
back to val_batch_size or skip the division) so
master_config["grpo"]["max_val_samples"] //
master_config["grpo"]["val_batch_size"] cannot operate on None.

---

Nitpick comments:
In `@examples/nemo_gym/grpo_nanov3.yaml`:
- Around line 82-94: The commented-out config keys (e.g., the commented
converter_type near bias_activation_fusion and the commented moe_router_dtype
near freeze_moe_router) are indented as if they are children of the preceding
keys which is misleading; move those commented lines (converter_type and
moe_router_dtype) to the same indentation level as the other top-level keys
(match indentation of tensor_model_parallel_size, pipeline_dtype, etc.) or add a
short inline comment explaining why they're disabled so readers know they are
alternative top-level settings rather than nested properties.

In `@nemo_rl/algorithms/grpo.py`:
- Around line 1964-1978: The async_grpo_train logging is missing several fields
present in the sync grpo_train path; update async_grpo_train to add the same
keys into its log_data: include "agent_ref" from repeated_batch (if present),
and copy "token_ids", "token_loss_mask", "sample_loss_mask", and "advantages"
from train_data (using .tolist() like the sync path), ensuring the same key
names and any dynamic-sampling overrides (e.g., filtered_rewards/rewards) are
handled identically so both paths produce equivalent logs.

In `@nemo_rl/experience/rollouts.py`:
- Around line 982-988: The helper _tensorize_by_key currently only checks key
existence on message_logs[0] and then assumes all messages have that key; change
it to iterate messages and for each message check for the key (e.g., if key in m
or m.get(key) is not None) before converting, skipping messages that lack it to
avoid KeyError, and use torch.as_tensor instead of torch.tensor to avoid
unnecessary copies when the value is already a tensor; update the function body
to perform per-message presence check and conversion using
torch.as_tensor(m[key]).

examples/nemo_gym/grpo_nanov3.yaml

Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>

terrykong

lgtm

yfw and others added 6 commits February 17, 2026 23:22

Add nano-v3 config

3eb2c48

Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>

Cherry pick gym-related nano-v3 changes

1503eb7

Co-authored-by: Peter Jin <pjin@nvidia.com> Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>

Update config

0b6757b

Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>

Merge remote-tracking branch 'origin/main' into yifu/nano-v3-config

b86a31f

Remove unused

4d35ce3

Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>

Merge remote-tracking branch 'origin/main' into yifu/nano-v3-config

e5f2e7b

yfw marked this pull request as ready for review February 20, 2026 18:35

yfw requested review from a team as code owners February 20, 2026 18:35

yfw added the CI:L1 Run doctests, unit tests, and functional tests label Feb 20, 2026

yfw temporarily deployed to nemo-ci February 20, 2026 18:36 — with GitHub Actions Inactive

yfw temporarily deployed to nemo-ci February 20, 2026 18:40 — with GitHub Actions Inactive

coderabbitai bot reviewed Feb 20, 2026

View reviewed changes

examples/nemo_gym/grpo_nanov3.yaml Show resolved Hide resolved

lint

8f8857b

Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>

yfw added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Feb 20, 2026

yfw temporarily deployed to nemo-ci February 20, 2026 19:11 — with GitHub Actions Inactive

yfw temporarily deployed to nemo-ci February 20, 2026 19:14 — with GitHub Actions Inactive

terrykong reviewed Feb 20, 2026

View reviewed changes

terrykong approved these changes Feb 20, 2026

View reviewed changes

terrykong enabled auto-merge (squash) February 20, 2026 21:10

yfw had a problem deploying to nemo-ci February 21, 2026 03:07 — with GitHub Actions Failure

yfw temporarily deployed to nemo-ci February 21, 2026 05:10 — with GitHub Actions Inactive

terrykong merged commit f11f99c into main Feb 21, 2026
69 of 75 checks passed

terrykong deleted the yifu/nano-v3-config branch February 21, 2026 08:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

feat: Nano v3 RL Recipe#1989

feat: Nano v3 RL Recipe#1989
terrykong merged 7 commits intomainfrom
yifu/nano-v3-config

yfw commented Feb 18, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Feb 20, 2026

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

terrykong left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

yfw commented Feb 18, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Issues

Usage

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Feb 20, 2026

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

terrykong left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yfw commented Feb 18, 2026 •

edited by coderabbitai bot

Loading