Conversation
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
Co-authored-by: Peter Jin <pjin@nvidia.com> Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
📝 WalkthroughWalkthroughThis PR adds a new YAML configuration for GRPO Nano v3 training, enriches GRPO algorithm logging with additional training signals (agent_ref, advantages, token losses), tensorizes rollout data for downstream consistency, and updates the generation initialization logic to consider HTTP server exposure settings. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 3 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Tip Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (3)
nemo_rl/algorithms/grpo.py (1)
1964-1978: Logging inconsistency between sync and async GRPO paths.The sync
grpo_trainpath now logsagent_ref,token_ids,token_loss_mask,sample_loss_mask, andadvantages, but the asyncasync_grpo_trainpath (lines 2983–2990) does not include these fields. If parity is desired for debugging, consider aligning them.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_rl/algorithms/grpo.py` around lines 1964 - 1978, The async_grpo_train logging is missing several fields present in the sync grpo_train path; update async_grpo_train to add the same keys into its log_data: include "agent_ref" from repeated_batch (if present), and copy "token_ids", "token_loss_mask", "sample_loss_mask", and "advantages" from train_data (using .tolist() like the sync path), ensuring the same key names and any dynamic-sampling overrides (e.g., filtered_rewards/rewards) are handled identically so both paths produce equivalent logs.examples/nemo_gym/grpo_nanov3.yaml (1)
82-94: Misleading indentation on commented-out config lines.Lines 83 and 94 are commented-out alternatives indented as if they are children of their preceding keys (
bias_activation_fusionandfreeze_moe_routerrespectively). This is confusing for readers. Consider aligning them at the same level as their siblings or adding a brief note about why they're commented out.Suggested alignment
bias_activation_fusion: False - # converter_type: "Qwen2ForCausalLM" + # converter_type: "Qwen2ForCausalLM" # Not needed for this model tensor_model_parallel_size: 2 ... freeze_moe_router: true - # moe_router_dtype: "fp64" + # moe_router_dtype: "fp64" # fp32 used instead for stability moe_router_dtype: "fp32"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/nemo_gym/grpo_nanov3.yaml` around lines 82 - 94, The commented-out config keys (e.g., the commented converter_type near bias_activation_fusion and the commented moe_router_dtype near freeze_moe_router) are indented as if they are children of the preceding keys which is misleading; move those commented lines (converter_type and moe_router_dtype) to the same indentation level as the other top-level keys (match indentation of tensor_model_parallel_size, pipeline_dtype, etc.) or add a short inline comment explaining why they're disabled so readers know they are alternative top-level settings rather than nested properties.nemo_rl/experience/rollouts.py (1)
982-988:_tensorize_by_keyonly checks the first message for key presence.If some messages in the list lack
keywhile the first one has it, this will raise aKeyError. Consider usingm.get(key)or checking each message. In the current call sites this appears safe (all messages in a filtered list should have the key), but the helper is generic enough to surprise future callers.Also,
torch.as_tensoravoids an unnecessary copy when the value is already a tensor, which would be slightly more defensive.Proposed defensive variant
def _tensorize_by_key(message_logs: list, key: str): if not message_logs or key not in message_logs[0]: return for m in message_logs: - m[key] = torch.tensor(m[key]) + if key in m: + m[key] = torch.as_tensor(m[key])🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_rl/experience/rollouts.py` around lines 982 - 988, The helper _tensorize_by_key currently only checks key existence on message_logs[0] and then assumes all messages have that key; change it to iterate messages and for each message check for the key (e.g., if key in m or m.get(key) is not None) before converting, skipping messages that lack it to avoid KeyError, and use torch.as_tensor instead of torch.tensor to avoid unnecessary copies when the value is already a tensor; update the function body to perform per-message presence check and conversion using torch.as_tensor(m[key]).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@examples/nemo_gym/grpo_nanov3.yaml`:
- Line 191: The config is invalid: _should_use_nemo_gym in grpo.py asserts
_should_use_async_rollouts (which requires vllm_cfg.async_engine == True), so
either set vllm_cfg.async_engine to true (change async_engine: false → true) or
disable should_use_nemo_gym (set should_use_nemo_gym: false); additionally,
avoid max_val_samples being null — set grpo.max_val_samples to a positive
integer or update validation to handle null by providing a default (e.g., fall
back to val_batch_size or skip the division) so
master_config["grpo"]["max_val_samples"] //
master_config["grpo"]["val_batch_size"] cannot operate on None.
---
Nitpick comments:
In `@examples/nemo_gym/grpo_nanov3.yaml`:
- Around line 82-94: The commented-out config keys (e.g., the commented
converter_type near bias_activation_fusion and the commented moe_router_dtype
near freeze_moe_router) are indented as if they are children of the preceding
keys which is misleading; move those commented lines (converter_type and
moe_router_dtype) to the same indentation level as the other top-level keys
(match indentation of tensor_model_parallel_size, pipeline_dtype, etc.) or add a
short inline comment explaining why they're disabled so readers know they are
alternative top-level settings rather than nested properties.
In `@nemo_rl/algorithms/grpo.py`:
- Around line 1964-1978: The async_grpo_train logging is missing several fields
present in the sync grpo_train path; update async_grpo_train to add the same
keys into its log_data: include "agent_ref" from repeated_batch (if present),
and copy "token_ids", "token_loss_mask", "sample_loss_mask", and "advantages"
from train_data (using .tolist() like the sync path), ensuring the same key
names and any dynamic-sampling overrides (e.g., filtered_rewards/rewards) are
handled identically so both paths produce equivalent logs.
In `@nemo_rl/experience/rollouts.py`:
- Around line 982-988: The helper _tensorize_by_key currently only checks key
existence on message_logs[0] and then assumes all messages have that key; change
it to iterate messages and for each message check for the key (e.g., if key in m
or m.get(key) is not None) before converting, skipping messages that lack it to
avoid KeyError, and use torch.as_tensor instead of torch.tensor to avoid
unnecessary copies when the value is already a tensor; update the function body
to perform per-message presence check and conversion using
torch.as_tensor(m[key]).
What does this PR do ?
Add a one line overview of what this PR aims to accomplish.
Issues
List issues that this PR closes (syntax):
Usage
# Add a code snippet demonstrating how to use thisBefore your PR is "Ready for review"
Pre checks:
Additional Information
Summary by CodeRabbit