Skip to content

Comments

feat: Nano v3 RL Recipe#1989

Merged
terrykong merged 7 commits intomainfrom
yifu/nano-v3-config
Feb 21, 2026
Merged

feat: Nano v3 RL Recipe#1989
terrykong merged 7 commits intomainfrom
yifu/nano-v3-config

Conversation

@yfw
Copy link
Contributor

@yfw yfw commented Feb 18, 2026

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

Issues

List issues that this PR closes (syntax):

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
  • Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

  • ...

Summary by CodeRabbit

  • New Features
    • Added comprehensive GRPO training configuration example for Nano v3 with advanced rollout settings, evaluation scheduling, and baseline management options.
    • Enhanced training logging with expanded data payloads capturing advantages, generation log probabilities, and per-token loss metrics for deeper training visibility.
    • Improved rollout tracking with agent reference extraction and truncation detection across multiple rollout paths for better per-sample analysis.
    • Optimized generation backend configuration for improved vllm HTTP server compatibility.

yfw and others added 6 commits February 17, 2026 23:22
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
Co-authored-by: Peter Jin <pjin@nvidia.com>
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
@yfw yfw marked this pull request as ready for review February 20, 2026 18:35
@yfw yfw requested review from a team as code owners February 20, 2026 18:35
@yfw yfw added the CI:L1 Run doctests, unit tests, and functional tests label Feb 20, 2026
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 20, 2026

📝 Walkthrough

Walkthrough

This PR adds a new YAML configuration for GRPO Nano v3 training, enriches GRPO algorithm logging with additional training signals (agent_ref, advantages, token losses), tensorizes rollout data for downstream consistency, and updates the generation initialization logic to consider HTTP server exposure settings.

Changes

Cohort / File(s) Summary
Configuration
examples/nemo_gym/grpo_nanov3.yaml
New comprehensive GRPO training configuration defining rollout settings, loss function parameters, policy/model configuration, Megatron-tuned training settings, dynamic batching, vllm generation backend, data sources, Nemo Gym environment, logging, and cluster specifications.
Algorithm Logging
nemo_rl/algorithms/grpo.py
Expanded log_data payload to include agent_ref, token_ids, token_loss_mask, sample_loss_mask, advantages, and logprob fields. Extended flat_messages lifecycle to persist through logging operations for richer training signal capture.
Rollout Processing
nemo_rl/experience/rollouts.py
Introduced _tensorize_by_key() helper to convert message log keys to torch tensors. Enhanced both async rollout paths to extract and propagate agent_ref field, compute truncated field from hit_max_tokens metrics, and tensorize token_ids across input/generation/message logs.
Generation Configuration
nemo_rl/models/generation/__init__.py
Updated skip_tokenizer_init logic to additionally trigger tokenizer initialization when vllm_cfg.expose_http_server is present, alongside existing is_eval and stop_strings conditions.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Suggested labels

CI:L1

Suggested reviewers

  • terrykong
🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Test Results For Major Changes ⚠️ Warning PR introduces major GRPO training feature but lacks documentation, test results, performance metrics, and contains unfixed critical configuration bugs. Fix critical bugs (async_engine: true, max_val_samples: 1024), provide complete PR description with test results and performance metrics, confirm end-to-end testing.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat: Nano v3 RL Recipe' directly and specifically describes the main change: adding a new Nano v3 configuration recipe for RL training, which is the primary focus of the changeset (new YAML config file, supporting code updates, and related rollout/generation changes).
Docstring Coverage ✅ Passed Docstring coverage is 80.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch yifu/nano-v3-config

Tip

Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (3)
nemo_rl/algorithms/grpo.py (1)

1964-1978: Logging inconsistency between sync and async GRPO paths.

The sync grpo_train path now logs agent_ref, token_ids, token_loss_mask, sample_loss_mask, and advantages, but the async async_grpo_train path (lines 2983–2990) does not include these fields. If parity is desired for debugging, consider aligning them.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_rl/algorithms/grpo.py` around lines 1964 - 1978, The async_grpo_train
logging is missing several fields present in the sync grpo_train path; update
async_grpo_train to add the same keys into its log_data: include "agent_ref"
from repeated_batch (if present), and copy "token_ids", "token_loss_mask",
"sample_loss_mask", and "advantages" from train_data (using .tolist() like the
sync path), ensuring the same key names and any dynamic-sampling overrides
(e.g., filtered_rewards/rewards) are handled identically so both paths produce
equivalent logs.
examples/nemo_gym/grpo_nanov3.yaml (1)

82-94: Misleading indentation on commented-out config lines.

Lines 83 and 94 are commented-out alternatives indented as if they are children of their preceding keys (bias_activation_fusion and freeze_moe_router respectively). This is confusing for readers. Consider aligning them at the same level as their siblings or adding a brief note about why they're commented out.

Suggested alignment
     bias_activation_fusion: False
-      # converter_type: "Qwen2ForCausalLM"
+    # converter_type: "Qwen2ForCausalLM"  # Not needed for this model
     tensor_model_parallel_size: 2
...
     freeze_moe_router: true
-      # moe_router_dtype: "fp64"
+    # moe_router_dtype: "fp64"  # fp32 used instead for stability
     moe_router_dtype: "fp32"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/nemo_gym/grpo_nanov3.yaml` around lines 82 - 94, The commented-out
config keys (e.g., the commented converter_type near bias_activation_fusion and
the commented moe_router_dtype near freeze_moe_router) are indented as if they
are children of the preceding keys which is misleading; move those commented
lines (converter_type and moe_router_dtype) to the same indentation level as the
other top-level keys (match indentation of tensor_model_parallel_size,
pipeline_dtype, etc.) or add a short inline comment explaining why they're
disabled so readers know they are alternative top-level settings rather than
nested properties.
nemo_rl/experience/rollouts.py (1)

982-988: _tensorize_by_key only checks the first message for key presence.

If some messages in the list lack key while the first one has it, this will raise a KeyError. Consider using m.get(key) or checking each message. In the current call sites this appears safe (all messages in a filtered list should have the key), but the helper is generic enough to surprise future callers.

Also, torch.as_tensor avoids an unnecessary copy when the value is already a tensor, which would be slightly more defensive.

Proposed defensive variant
 def _tensorize_by_key(message_logs: list, key: str):
     if not message_logs or key not in message_logs[0]:
         return
 
     for m in message_logs:
-        m[key] = torch.tensor(m[key])
+        if key in m:
+            m[key] = torch.as_tensor(m[key])
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_rl/experience/rollouts.py` around lines 982 - 988, The helper
_tensorize_by_key currently only checks key existence on message_logs[0] and
then assumes all messages have that key; change it to iterate messages and for
each message check for the key (e.g., if key in m or m.get(key) is not None)
before converting, skipping messages that lack it to avoid KeyError, and use
torch.as_tensor instead of torch.tensor to avoid unnecessary copies when the
value is already a tensor; update the function body to perform per-message
presence check and conversion using torch.as_tensor(m[key]).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/nemo_gym/grpo_nanov3.yaml`:
- Line 191: The config is invalid: _should_use_nemo_gym in grpo.py asserts
_should_use_async_rollouts (which requires vllm_cfg.async_engine == True), so
either set vllm_cfg.async_engine to true (change async_engine: false → true) or
disable should_use_nemo_gym (set should_use_nemo_gym: false); additionally,
avoid max_val_samples being null — set grpo.max_val_samples to a positive
integer or update validation to handle null by providing a default (e.g., fall
back to val_batch_size or skip the division) so
master_config["grpo"]["max_val_samples"] //
master_config["grpo"]["val_batch_size"] cannot operate on None.

---

Nitpick comments:
In `@examples/nemo_gym/grpo_nanov3.yaml`:
- Around line 82-94: The commented-out config keys (e.g., the commented
converter_type near bias_activation_fusion and the commented moe_router_dtype
near freeze_moe_router) are indented as if they are children of the preceding
keys which is misleading; move those commented lines (converter_type and
moe_router_dtype) to the same indentation level as the other top-level keys
(match indentation of tensor_model_parallel_size, pipeline_dtype, etc.) or add a
short inline comment explaining why they're disabled so readers know they are
alternative top-level settings rather than nested properties.

In `@nemo_rl/algorithms/grpo.py`:
- Around line 1964-1978: The async_grpo_train logging is missing several fields
present in the sync grpo_train path; update async_grpo_train to add the same
keys into its log_data: include "agent_ref" from repeated_batch (if present),
and copy "token_ids", "token_loss_mask", "sample_loss_mask", and "advantages"
from train_data (using .tolist() like the sync path), ensuring the same key
names and any dynamic-sampling overrides (e.g., filtered_rewards/rewards) are
handled identically so both paths produce equivalent logs.

In `@nemo_rl/experience/rollouts.py`:
- Around line 982-988: The helper _tensorize_by_key currently only checks key
existence on message_logs[0] and then assumes all messages have that key; change
it to iterate messages and for each message check for the key (e.g., if key in m
or m.get(key) is not None) before converting, skipping messages that lack it to
avoid KeyError, and use torch.as_tensor instead of torch.tensor to avoid
unnecessary copies when the value is already a tensor; update the function body
to perform per-message presence check and conversion using
torch.as_tensor(m[key]).

Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
@yfw yfw added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Feb 20, 2026
Copy link
Contributor

@terrykong terrykong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@terrykong terrykong enabled auto-merge (squash) February 20, 2026 21:10
@terrykong terrykong merged commit f11f99c into main Feb 21, 2026
69 of 75 checks passed
@terrykong terrykong deleted the yifu/nano-v3-config branch February 21, 2026 08:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI:L1 Run doctests, unit tests, and functional tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants