feat: Remove do_not_average_loss by yfw · Pull Request #1988 · NVIDIA-NeMo/RL

yfw · 2026-02-18T23:14:53Z

What does this PR do ?

Moves logic for do_not_average_loss into nemo RL so we can use mcore main directly.

Issues

List issues that this PR closes (syntax):

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

GRPO runs with different CP

SFT runs with different CP

Summary by CodeRabbit

Release Notes

Bug Fixes
- Corrected loss averaging behavior in distributed training. Loss scaling now properly accounts for microbatch counts and context-parallel configurations to ensure consistent gradient computation across training runs.
Chores
- Updated Megatron-LM submodule reference to improve compatibility.
Tests
- Updated unit tests to validate new microbatch-aware loss scaling behavior in distributed training scenarios.

Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>

github-actions · 2026-02-18T23:16:14Z

✅ Submodule Fast-Forward Check Results

Check based on commit: 96d4a11 (PR #1988 from yifu/remove_do_not_average_loss)

✅ Submodules that are properly updated:

Megatron-LM: ✅ PR branch is ahead of main branch (fast-forward)

All submodule changes look good! ✨

coderabbitai · 2026-02-18T23:25:26Z

📝 Walkthrough

Walkthrough

The PR removes the do_not_average_loss parameter from the Megatron training pipeline and replaces it with a num_microbatches parameter in LossPostProcessor. Loss scaling logic is adjusted to multiply by num_microbatches / cp_size to counteract Megatron's default loss averaging behavior. The Megatron-LM submodule branch switches to a custom branch supporting these changes.

Changes

Cohort / File(s)	Summary
Submodule Configuration `.gitmodules`, `3rdparty/Megatron-LM-workspace/Megatron-LM`	Megatron-LM submodule branch changed from `main` to `yifu/remove_do_not_average_loss` and pointer updated to new commit.
Training Pipeline Core `nemo_rl/models/megatron/train.py`	Removed `do_not_average_loss` parameter from `megatron_forward_backward()`. Added `num_microbatches` parameter to `LossPostProcessor` constructor. Implemented loss-wrapping layer scaling loss by `num_microbatches / cp_size` to counteract Megatron's loss averaging.
Worker Integration `nemo_rl/models/policy/workers/megatron_policy_worker.py`	Updated `LossPostProcessor` instantiation to pass `num_microbatches` parameter and removed `do_not_average_loss=True` argument from `megatron_forward_backward()` call.
Test Updates `tests/unit/models/megatron/test_train.py`	Added `num_microbatches=4` parameter to `LossPostProcessor` instantiation in CP normalization tests; adjusted expected loss scaling from 0.5 to 1.0.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

feat: refactor mcore train/forward utilities #1654: Modifies the same Megatron training pipeline components including megatron_forward_backward and LossPostProcessor wiring.
fix: Fixes to make Megatron backend match dtensor #1389: Updates Megatron policy worker behavior and loss averaging handling in the training path.
chore: bump mcore and mbridge #1902: Affects the same megatron_policy_worker.py training path with ProcessGroupCollection integration.

Suggested labels

CI:L1, super-v3

Suggested reviewers

yuki-97
terrykong

🚥 Pre-merge checks | ✅ 2 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 57.14% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Results For Major Changes	⚠️ Warning	PR removes do_not_average_loss parameter and adds num_microbatches parameter, changing loss scaling logic with numeric impact (test loss 0.5→1.0), but PR description lacks test results, convergence analysis, or documentation justifying these breaking changes.	Update PR description with comprehensive test results, convergence analysis, and documentation explaining why numeric changes preserve training semantics. Address submodule dependency on personal feature branch.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly describes the main change: removing the do_not_average_loss parameter from the codebase and moving that logic into nemo RL.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch yifu/remove_do_not_average_loss

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

nemo_rl/models/policy/workers/megatron_policy_worker.py (1)
1-1: ⚠️ Potential issue | 🟡 Minor

Copyright year should be updated to 2026

The file is modified in this PR but the copyright header still reads 2025. As per coding guidelines, the NVIDIA copyright header should carry the current year (2026) for all modified source files.
-# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
+# Copyright (c) 2026, NVIDIA CORPORATION.  All rights reserved.
As per coding guidelines, "Add the NVIDIA copyright header (with current year) to all Python files and shell scripts, excluding tests."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_rl/models/policy/workers/megatron_policy_worker.py` at line 1, Update
the copyright year in the file header of megatron_policy_worker.py from 2025 to
2026; locate the top-of-file copyright comment and replace the year so the
NVIDIA copyright header reflects 2026, ensuring the header format remains
identical otherwise.

🧹 Nitpick comments (2)

tests/unit/models/megatron/test_train.py (1)

681-709: test_loss_post_processor_no_packing doesn't exercise the new mcore counteraction path

The test uses cp_normalize=False, num_microbatches=1 (default), and cp_size=1 (mocked). The _counteract_mcore_loss_averaging scaling factor is 1 / 1 = 1 — a no-op — so the test cannot catch regressions in the actual counteraction logic. Consider adding a variant with num_microbatches > 1 and/or cp_size > 1.

✅ Suggested additional test

`@patch`("nemo_rl.models.megatron.train.get_tensor_model_parallel_rank", return_value=0)
`@patch`("nemo_rl.models.megatron.train.get_tensor_model_parallel_group")
`@patch`("nemo_rl.models.megatron.train.get_context_parallel_group")
`@patch`("nemo_rl.models.megatron.train.get_context_parallel_world_size", return_value=2)
def test_loss_post_processor_no_cp_normalize_mcore_scaling(
    self, mock_cp_size, mock_cp_grp, mock_tp_grp, mock_tp_rank
):
    """Test _counteract_mcore_loss_averaging with cp_normalize=False, num_microbatches>1."""
    from nemo_rl.models.megatron.train import LossPostProcessor

    mock_loss_fn = MagicMock(return_value=(torch.tensor(1.0), {}))
    cfg = {"sequence_packing": {"enabled": False}}
    mock_tp_grp.return_value = MagicMock()
    mock_cp_grp.return_value = MagicMock()

    # cp_size=2, num_microbatches=4 → scaling = 4/2 = 2.0
    processor = LossPostProcessor(
        loss_fn=mock_loss_fn, cfg=cfg, num_microbatches=4, cp_normalize=False
    )
    wrapped_fn = processor(data_dict=MagicMock())
    loss, _ = wrapped_fn(torch.randn(1))
    # Megatron will then apply * cp_size / num_microbatches = 2/4 = 0.5
    # net result = 2.0 * 0.5 = 1.0 (original) — verify pre-scaling value here
    assert torch.isclose(loss, torch.tensor(2.0))

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/unit/models/megatron/test_train.py` around lines 681 - 709, The current
test test_loss_post_processor_no_packing doesn't exercise
LossPostProcessor._counteract_mcore_loss_averaging because cp_size and
num_microbatches default to 1; add a new unit test variant that patches
get_context_parallel_world_size to return >1 (e.g., 2), sets num_microbatches >1
(e.g., 4) and cp_normalize=False, constructs
LossPostProcessor(loss_fn=MagicMock(return_value=(torch.tensor(1.0), {})),
cfg={"sequence_packing": {"enabled": False}}, num_microbatches=4,
cp_normalize=False), calls the wrapped function and asserts the returned loss
equals the expected pre-scaling value (i.e., the
_counteract_mcore_loss_averaging scaling factor num_microbatches/cp_size is
applied); reference the LossPostProcessor class,
_counteract_mcore_loss_averaging behavior, and the patched
get_context_parallel_world_size and num_microbatches/cp_size parameters to
locate and validate the logic.

nemo_rl/models/megatron/train.py (1)

325-345: Fix the dual cp_size assignment to improve code clarity, and verify Megatron's loss-scaling behavior in forward_only mode.

1. cp_size captured via stale closure reference — valid concern.
The closure _div_by_cp_size (line 328) captures cp_size by reference, and the variable is reassigned at line 337. Both calls to get_context_parallel_world_size() return the same value, making this functionally safe, but it's a maintenance risk. The suggested refactor to compute cp_size once before the if self.cp_normalize: block is sound:
♻️ Suggested refactor
+        cp_size = get_context_parallel_world_size()
         if self.cp_normalize:
-            cp_size = get_context_parallel_world_size()
             prev_loss_fn = loss_fn_wrapped

             def _div_by_cp_size(*args, **kwargs):
                 loss, metrics = prev_loss_fn(*args, **kwargs)
                 return loss / cp_size, metrics

             loss_fn_wrapped = _div_by_cp_size

         # Counteract Megatron's default loss averaging in schedules.py,
         # which applies (* cp_size / num_microbatches) to the loss.
-        cp_size = get_context_parallel_world_size()
         num_microbatches = self.num_microbatches
2. _counteract_mcore_loss_averaging is applied unconditionally.
This is accurate — the loss scaling loss * num_microbatches / cp_size has no conditional logic for forward_only mode. However, LossPostProcessor.__call__() does not receive the forward_only parameter, so conditional application would require architectural changes. The branch yifu/remove_do_not_average_loss suggests Megatron applies loss averaging uniformly, which would justify the unconditional counteracting. Confirm that Megatron-LM's forward_backward_func applies the loss scaling factor uniformly regardless of forward_only mode; if Megatron conditionally skips averaging during inference, the current approach will produce incorrectly scaled eval losses.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_rl/models/megatron/train.py` around lines 325 - 345, Compute cp_size
once before the if block and bind it into the closure to avoid the stale closure
reference: call get_context_parallel_world_size() a single time (store in
cp_size) before the if self.cp_normalize: block and ensure _div_by_cp_size
captures that cp_size (e.g., by referencing the local cp_size or binding it as a
default arg). Also avoid applying the mcore counter-scaling unconditionally: use
the already-computed num_microbatches and cp_size and only wrap loss_fn_wrapped
with _counteract_mcore_loss_averaging when in training mode (e.g., guard with if
not getattr(self, "forward_only", False) or if self.training), or if
forward_only isn’t available add a boolean parameter/attribute (e.g.,
forward_only) to LossPostProcessor.__call__/the containing object and use that
to skip the counter-scaling during forward-only/eval runs. Ensure you update
references to loss_fn_wrapped, _div_by_cp_size, and
_counteract_mcore_loss_averaging accordingly.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.gitmodules:
- Line 4: The .gitmodules entry pins the Megatron-LM submodule to a personal
feature branch ("branch = yifu/remove_do_not_average_loss") which is unstable;
update the submodule configuration so it points to the canonical branch (change
the branch field to "main") and ensure any intended changes are merged into
Megatron-LM main first; specifically edit the .gitmodules entry for the
Megatron-LM submodule (the branch = yifu/remove_do_not_average_loss line) and
replace it with branch = main, then update the submodule commit (git submodule
sync && git submodule update --init --remote) so the repo references the
upstream main tip.

In `@3rdparty/Megatron-LM-workspace/Megatron-LM`:
- Line 1: The .gitmodules entry for the Megatron-LM submodule is pinned to a
personal fork and branch (repository URL and branch name
yifu/remove_do_not_average_loss and SHA b12071b9...) which contradicts the PR
goal of using mcore main; update the submodule configuration in .gitmodules to
point to the official upstream repository and set the branch to "main" (or
remove the branch/sha pin), and remove or set shallow = false so full history is
available, then update the submodule commit (git submodule sync && git submodule
update --init --remote Megatron-LM) to reference an upstream main commit;
alternatively, if depending on the personal branch is intentional, update the PR
description to explicitly state the dependency on that fork/branch.

---

Outside diff comments:
In `@nemo_rl/models/policy/workers/megatron_policy_worker.py`:
- Line 1: Update the copyright year in the file header of
megatron_policy_worker.py from 2025 to 2026; locate the top-of-file copyright
comment and replace the year so the NVIDIA copyright header reflects 2026,
ensuring the header format remains identical otherwise.

---

Nitpick comments:
In `@nemo_rl/models/megatron/train.py`:
- Around line 325-345: Compute cp_size once before the if block and bind it into
the closure to avoid the stale closure reference: call
get_context_parallel_world_size() a single time (store in cp_size) before the if
self.cp_normalize: block and ensure _div_by_cp_size captures that cp_size (e.g.,
by referencing the local cp_size or binding it as a default arg). Also avoid
applying the mcore counter-scaling unconditionally: use the already-computed
num_microbatches and cp_size and only wrap loss_fn_wrapped with
_counteract_mcore_loss_averaging when in training mode (e.g., guard with if not
getattr(self, "forward_only", False) or if self.training), or if forward_only
isn’t available add a boolean parameter/attribute (e.g., forward_only) to
LossPostProcessor.__call__/the containing object and use that to skip the
counter-scaling during forward-only/eval runs. Ensure you update references to
loss_fn_wrapped, _div_by_cp_size, and _counteract_mcore_loss_averaging
accordingly.

In `@tests/unit/models/megatron/test_train.py`:
- Around line 681-709: The current test test_loss_post_processor_no_packing
doesn't exercise LossPostProcessor._counteract_mcore_loss_averaging because
cp_size and num_microbatches default to 1; add a new unit test variant that
patches get_context_parallel_world_size to return >1 (e.g., 2), sets
num_microbatches >1 (e.g., 4) and cp_normalize=False, constructs
LossPostProcessor(loss_fn=MagicMock(return_value=(torch.tensor(1.0), {})),
cfg={"sequence_packing": {"enabled": False}}, num_microbatches=4,
cp_normalize=False), calls the wrapped function and asserts the returned loss
equals the expected pre-scaling value (i.e., the
_counteract_mcore_loss_averaging scaling factor num_microbatches/cp_size is
applied); reference the LossPostProcessor class,
_counteract_mcore_loss_averaging behavior, and the patched
get_context_parallel_world_size and num_microbatches/cp_size parameters to
locate and validate the logic.

.gitmodules

3rdparty/Megatron-LM-workspace/Megatron-LM

Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com> Signed-off-by: yuanhangs <yuanhangs@nvidia.com>

yfw added 2 commits February 17, 2026 18:11

Remove do_not_average_loss

d586ddb

Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>

Update gitmodules for mcore branch

96d4a11

Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>

yfw requested review from a team as code owners February 18, 2026 23:14

coderabbitai bot reviewed Feb 18, 2026

View reviewed changes

.gitmodules Show resolved Hide resolved

3rdparty/Megatron-LM-workspace/Megatron-LM Show resolved Hide resolved

yfw added the CI:L1 Run doctests, unit tests, and functional tests label Feb 19, 2026

yfw temporarily deployed to nemo-ci February 19, 2026 02:15 — with GitHub Actions Inactive

yfw requested review from terrykong and yaoyu-33 February 19, 2026 03:46

terrykong approved these changes Feb 19, 2026

View reviewed changes

terrykong enabled auto-merge (squash) February 19, 2026 03:57

yfw had a problem deploying to nemo-ci February 19, 2026 04:38 — with GitHub Actions Failure

yfw had a problem deploying to nemo-ci February 19, 2026 06:17 — with GitHub Actions Failure

yfw had a problem deploying to nemo-ci February 19, 2026 06:18 — with GitHub Actions Failure

yfw temporarily deployed to nemo-ci February 19, 2026 06:19 — with GitHub Actions Inactive

yfw had a problem deploying to nemo-ci February 19, 2026 17:44 — with GitHub Actions Failure

yfw temporarily deployed to nemo-ci February 19, 2026 21:23 — with GitHub Actions Inactive

terrykong merged commit 84bede0 into main Feb 20, 2026
112 of 123 checks passed

terrykong deleted the yifu/remove_do_not_average_loss branch February 20, 2026 00:15

yuanhangsu1986 pushed a commit to yuanhangsu1986/RL-Nemontron-Edge-Omni that referenced this pull request Feb 21, 2026

feat: Remove do_not_average_loss (NVIDIA-NeMo#1988)

5e6bfa9

Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com> Signed-off-by: yuanhangs <yuanhangs@nvidia.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

feat: Remove do_not_average_loss#1988

feat: Remove do_not_average_loss#1988
terrykong merged 2 commits intomainfrom
yifu/remove_do_not_average_loss

yfw commented Feb 18, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Feb 18, 2026

Uh oh!

coderabbitai bot commented Feb 18, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

❌ Failed checks (2 warnings)

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

yfw commented Feb 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Issues

Usage

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Release Notes

Uh oh!

github-actions bot commented Feb 18, 2026

✅ Submodule Fast-Forward Check Results

✅ Submodules that are properly updated:

Uh oh!

coderabbitai bot commented Feb 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

❌ Failed checks (2 warnings)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yfw commented Feb 18, 2026 •

edited

Loading

coderabbitai bot commented Feb 18, 2026 •

edited

Loading