Conversation
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
📝 WalkthroughWalkthroughThe PR removes the Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 2 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
nemo_rl/models/policy/workers/megatron_policy_worker.py (1)
1-1:⚠️ Potential issue | 🟡 MinorCopyright year should be updated to 2026
The file is modified in this PR but the copyright header still reads
2025. As per coding guidelines, the NVIDIA copyright header should carry the current year (2026) for all modified source files.-# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. +# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.As per coding guidelines, "Add the NVIDIA copyright header (with current year) to all Python files and shell scripts, excluding tests."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_rl/models/policy/workers/megatron_policy_worker.py` at line 1, Update the copyright year in the file header of megatron_policy_worker.py from 2025 to 2026; locate the top-of-file copyright comment and replace the year so the NVIDIA copyright header reflects 2026, ensuring the header format remains identical otherwise.
🧹 Nitpick comments (2)
tests/unit/models/megatron/test_train.py (1)
681-709:test_loss_post_processor_no_packingdoesn't exercise the new mcore counteraction pathThe test uses
cp_normalize=False,num_microbatches=1(default), andcp_size=1(mocked). The_counteract_mcore_loss_averagingscaling factor is1 / 1 = 1— a no-op — so the test cannot catch regressions in the actual counteraction logic. Consider adding a variant withnum_microbatches > 1and/orcp_size > 1.✅ Suggested additional test
`@patch`("nemo_rl.models.megatron.train.get_tensor_model_parallel_rank", return_value=0) `@patch`("nemo_rl.models.megatron.train.get_tensor_model_parallel_group") `@patch`("nemo_rl.models.megatron.train.get_context_parallel_group") `@patch`("nemo_rl.models.megatron.train.get_context_parallel_world_size", return_value=2) def test_loss_post_processor_no_cp_normalize_mcore_scaling( self, mock_cp_size, mock_cp_grp, mock_tp_grp, mock_tp_rank ): """Test _counteract_mcore_loss_averaging with cp_normalize=False, num_microbatches>1.""" from nemo_rl.models.megatron.train import LossPostProcessor mock_loss_fn = MagicMock(return_value=(torch.tensor(1.0), {})) cfg = {"sequence_packing": {"enabled": False}} mock_tp_grp.return_value = MagicMock() mock_cp_grp.return_value = MagicMock() # cp_size=2, num_microbatches=4 → scaling = 4/2 = 2.0 processor = LossPostProcessor( loss_fn=mock_loss_fn, cfg=cfg, num_microbatches=4, cp_normalize=False ) wrapped_fn = processor(data_dict=MagicMock()) loss, _ = wrapped_fn(torch.randn(1)) # Megatron will then apply * cp_size / num_microbatches = 2/4 = 0.5 # net result = 2.0 * 0.5 = 1.0 (original) — verify pre-scaling value here assert torch.isclose(loss, torch.tensor(2.0))🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unit/models/megatron/test_train.py` around lines 681 - 709, The current test test_loss_post_processor_no_packing doesn't exercise LossPostProcessor._counteract_mcore_loss_averaging because cp_size and num_microbatches default to 1; add a new unit test variant that patches get_context_parallel_world_size to return >1 (e.g., 2), sets num_microbatches >1 (e.g., 4) and cp_normalize=False, constructs LossPostProcessor(loss_fn=MagicMock(return_value=(torch.tensor(1.0), {})), cfg={"sequence_packing": {"enabled": False}}, num_microbatches=4, cp_normalize=False), calls the wrapped function and asserts the returned loss equals the expected pre-scaling value (i.e., the _counteract_mcore_loss_averaging scaling factor num_microbatches/cp_size is applied); reference the LossPostProcessor class, _counteract_mcore_loss_averaging behavior, and the patched get_context_parallel_world_size and num_microbatches/cp_size parameters to locate and validate the logic.nemo_rl/models/megatron/train.py (1)
325-345: Fix the dualcp_sizeassignment to improve code clarity, and verify Megatron's loss-scaling behavior inforward_onlymode.1.
cp_sizecaptured via stale closure reference — valid concern.
The closure_div_by_cp_size(line 328) capturescp_sizeby reference, and the variable is reassigned at line 337. Both calls toget_context_parallel_world_size()return the same value, making this functionally safe, but it's a maintenance risk. The suggested refactor to computecp_sizeonce before theif self.cp_normalize:block is sound:♻️ Suggested refactor
+ cp_size = get_context_parallel_world_size() if self.cp_normalize: - cp_size = get_context_parallel_world_size() prev_loss_fn = loss_fn_wrapped def _div_by_cp_size(*args, **kwargs): loss, metrics = prev_loss_fn(*args, **kwargs) return loss / cp_size, metrics loss_fn_wrapped = _div_by_cp_size # Counteract Megatron's default loss averaging in schedules.py, # which applies (* cp_size / num_microbatches) to the loss. - cp_size = get_context_parallel_world_size() num_microbatches = self.num_microbatches2.
_counteract_mcore_loss_averagingis applied unconditionally.
This is accurate — the loss scalingloss * num_microbatches / cp_sizehas no conditional logic forforward_onlymode. However,LossPostProcessor.__call__()does not receive theforward_onlyparameter, so conditional application would require architectural changes. The branchyifu/remove_do_not_average_losssuggests Megatron applies loss averaging uniformly, which would justify the unconditional counteracting. Confirm that Megatron-LM'sforward_backward_funcapplies the loss scaling factor uniformly regardless offorward_onlymode; if Megatron conditionally skips averaging during inference, the current approach will produce incorrectly scaled eval losses.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_rl/models/megatron/train.py` around lines 325 - 345, Compute cp_size once before the if block and bind it into the closure to avoid the stale closure reference: call get_context_parallel_world_size() a single time (store in cp_size) before the if self.cp_normalize: block and ensure _div_by_cp_size captures that cp_size (e.g., by referencing the local cp_size or binding it as a default arg). Also avoid applying the mcore counter-scaling unconditionally: use the already-computed num_microbatches and cp_size and only wrap loss_fn_wrapped with _counteract_mcore_loss_averaging when in training mode (e.g., guard with if not getattr(self, "forward_only", False) or if self.training), or if forward_only isn’t available add a boolean parameter/attribute (e.g., forward_only) to LossPostProcessor.__call__/the containing object and use that to skip the counter-scaling during forward-only/eval runs. Ensure you update references to loss_fn_wrapped, _div_by_cp_size, and _counteract_mcore_loss_averaging accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.gitmodules:
- Line 4: The .gitmodules entry pins the Megatron-LM submodule to a personal
feature branch ("branch = yifu/remove_do_not_average_loss") which is unstable;
update the submodule configuration so it points to the canonical branch (change
the branch field to "main") and ensure any intended changes are merged into
Megatron-LM main first; specifically edit the .gitmodules entry for the
Megatron-LM submodule (the branch = yifu/remove_do_not_average_loss line) and
replace it with branch = main, then update the submodule commit (git submodule
sync && git submodule update --init --remote) so the repo references the
upstream main tip.
In `@3rdparty/Megatron-LM-workspace/Megatron-LM`:
- Line 1: The .gitmodules entry for the Megatron-LM submodule is pinned to a
personal fork and branch (repository URL and branch name
yifu/remove_do_not_average_loss and SHA b12071b9...) which contradicts the PR
goal of using mcore main; update the submodule configuration in .gitmodules to
point to the official upstream repository and set the branch to "main" (or
remove the branch/sha pin), and remove or set shallow = false so full history is
available, then update the submodule commit (git submodule sync && git submodule
update --init --remote Megatron-LM) to reference an upstream main commit;
alternatively, if depending on the personal branch is intentional, update the PR
description to explicitly state the dependency on that fork/branch.
---
Outside diff comments:
In `@nemo_rl/models/policy/workers/megatron_policy_worker.py`:
- Line 1: Update the copyright year in the file header of
megatron_policy_worker.py from 2025 to 2026; locate the top-of-file copyright
comment and replace the year so the NVIDIA copyright header reflects 2026,
ensuring the header format remains identical otherwise.
---
Nitpick comments:
In `@nemo_rl/models/megatron/train.py`:
- Around line 325-345: Compute cp_size once before the if block and bind it into
the closure to avoid the stale closure reference: call
get_context_parallel_world_size() a single time (store in cp_size) before the if
self.cp_normalize: block and ensure _div_by_cp_size captures that cp_size (e.g.,
by referencing the local cp_size or binding it as a default arg). Also avoid
applying the mcore counter-scaling unconditionally: use the already-computed
num_microbatches and cp_size and only wrap loss_fn_wrapped with
_counteract_mcore_loss_averaging when in training mode (e.g., guard with if not
getattr(self, "forward_only", False) or if self.training), or if forward_only
isn’t available add a boolean parameter/attribute (e.g., forward_only) to
LossPostProcessor.__call__/the containing object and use that to skip the
counter-scaling during forward-only/eval runs. Ensure you update references to
loss_fn_wrapped, _div_by_cp_size, and _counteract_mcore_loss_averaging
accordingly.
In `@tests/unit/models/megatron/test_train.py`:
- Around line 681-709: The current test test_loss_post_processor_no_packing
doesn't exercise LossPostProcessor._counteract_mcore_loss_averaging because
cp_size and num_microbatches default to 1; add a new unit test variant that
patches get_context_parallel_world_size to return >1 (e.g., 2), sets
num_microbatches >1 (e.g., 4) and cp_normalize=False, constructs
LossPostProcessor(loss_fn=MagicMock(return_value=(torch.tensor(1.0), {})),
cfg={"sequence_packing": {"enabled": False}}, num_microbatches=4,
cp_normalize=False), calls the wrapped function and asserts the returned loss
equals the expected pre-scaling value (i.e., the
_counteract_mcore_loss_averaging scaling factor num_microbatches/cp_size is
applied); reference the LossPostProcessor class,
_counteract_mcore_loss_averaging behavior, and the patched
get_context_parallel_world_size and num_microbatches/cp_size parameters to
locate and validate the logic.
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com> Signed-off-by: yuanhangs <yuanhangs@nvidia.com>
What does this PR do ?
Moves logic for do_not_average_loss into nemo RL so we can use mcore
maindirectly.Issues
List issues that this PR closes (syntax):
Usage
# Add a code snippet demonstrating how to use thisBefore your PR is "Ready for review"
Pre checks:
Additional Information
GRPO runs with different CP
SFT runs with different CP
Summary by CodeRabbit
Release Notes
Bug Fixes
Chores
Tests