Skip to content

Comments

feat: Remove do_not_average_loss#1988

Merged
terrykong merged 2 commits intomainfrom
yifu/remove_do_not_average_loss
Feb 20, 2026
Merged

feat: Remove do_not_average_loss#1988
terrykong merged 2 commits intomainfrom
yifu/remove_do_not_average_loss

Conversation

@yfw
Copy link
Contributor

@yfw yfw commented Feb 18, 2026

What does this PR do ?

Moves logic for do_not_average_loss into nemo RL so we can use mcore main directly.

Issues

List issues that this PR closes (syntax):

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
  • Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

GRPO runs with different CP

Screenshot 2026-02-18 at 7 48 50 PM

SFT runs with different CP

Screenshot 2026-02-19 at 12 41 53 AM

Summary by CodeRabbit

Release Notes

  • Bug Fixes

    • Corrected loss averaging behavior in distributed training. Loss scaling now properly accounts for microbatch counts and context-parallel configurations to ensure consistent gradient computation across training runs.
  • Chores

    • Updated Megatron-LM submodule reference to improve compatibility.
  • Tests

    • Updated unit tests to validate new microbatch-aware loss scaling behavior in distributed training scenarios.

yfw added 2 commits February 17, 2026 18:11
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
@yfw yfw requested review from a team as code owners February 18, 2026 23:14
@github-actions
Copy link

✅ Submodule Fast-Forward Check Results

Check based on commit: 96d4a11 (PR #1988 from yifu/remove_do_not_average_loss)

✅ Submodules that are properly updated:

Megatron-LM: ✅ PR branch is ahead of main branch (fast-forward)

All submodule changes look good! ✨

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 18, 2026

📝 Walkthrough

Walkthrough

The PR removes the do_not_average_loss parameter from the Megatron training pipeline and replaces it with a num_microbatches parameter in LossPostProcessor. Loss scaling logic is adjusted to multiply by num_microbatches / cp_size to counteract Megatron's default loss averaging behavior. The Megatron-LM submodule branch switches to a custom branch supporting these changes.

Changes

Cohort / File(s) Summary
Submodule Configuration
.gitmodules, 3rdparty/Megatron-LM-workspace/Megatron-LM
Megatron-LM submodule branch changed from main to yifu/remove_do_not_average_loss and pointer updated to new commit.
Training Pipeline Core
nemo_rl/models/megatron/train.py
Removed do_not_average_loss parameter from megatron_forward_backward(). Added num_microbatches parameter to LossPostProcessor constructor. Implemented loss-wrapping layer scaling loss by num_microbatches / cp_size to counteract Megatron's loss averaging.
Worker Integration
nemo_rl/models/policy/workers/megatron_policy_worker.py
Updated LossPostProcessor instantiation to pass num_microbatches parameter and removed do_not_average_loss=True argument from megatron_forward_backward() call.
Test Updates
tests/unit/models/megatron/test_train.py
Added num_microbatches=4 parameter to LossPostProcessor instantiation in CP normalization tests; adjusted expected loss scaling from 0.5 to 1.0.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Suggested labels

CI:L1, super-v3

Suggested reviewers

  • yuki-97
  • terrykong
🚥 Pre-merge checks | ✅ 2 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 57.14% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Results For Major Changes ⚠️ Warning PR removes do_not_average_loss parameter and adds num_microbatches parameter, changing loss scaling logic with numeric impact (test loss 0.5→1.0), but PR description lacks test results, convergence analysis, or documentation justifying these breaking changes. Update PR description with comprehensive test results, convergence analysis, and documentation explaining why numeric changes preserve training semantics. Address submodule dependency on personal feature branch.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly describes the main change: removing the do_not_average_loss parameter from the codebase and moving that logic into nemo RL.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch yifu/remove_do_not_average_loss

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
nemo_rl/models/policy/workers/megatron_policy_worker.py (1)

1-1: ⚠️ Potential issue | 🟡 Minor

Copyright year should be updated to 2026

The file is modified in this PR but the copyright header still reads 2025. As per coding guidelines, the NVIDIA copyright header should carry the current year (2026) for all modified source files.

-# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
+# Copyright (c) 2026, NVIDIA CORPORATION.  All rights reserved.

As per coding guidelines, "Add the NVIDIA copyright header (with current year) to all Python files and shell scripts, excluding tests."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_rl/models/policy/workers/megatron_policy_worker.py` at line 1, Update
the copyright year in the file header of megatron_policy_worker.py from 2025 to
2026; locate the top-of-file copyright comment and replace the year so the
NVIDIA copyright header reflects 2026, ensuring the header format remains
identical otherwise.
🧹 Nitpick comments (2)
tests/unit/models/megatron/test_train.py (1)

681-709: test_loss_post_processor_no_packing doesn't exercise the new mcore counteraction path

The test uses cp_normalize=False, num_microbatches=1 (default), and cp_size=1 (mocked). The _counteract_mcore_loss_averaging scaling factor is 1 / 1 = 1 — a no-op — so the test cannot catch regressions in the actual counteraction logic. Consider adding a variant with num_microbatches > 1 and/or cp_size > 1.

✅ Suggested additional test
`@patch`("nemo_rl.models.megatron.train.get_tensor_model_parallel_rank", return_value=0)
`@patch`("nemo_rl.models.megatron.train.get_tensor_model_parallel_group")
`@patch`("nemo_rl.models.megatron.train.get_context_parallel_group")
`@patch`("nemo_rl.models.megatron.train.get_context_parallel_world_size", return_value=2)
def test_loss_post_processor_no_cp_normalize_mcore_scaling(
    self, mock_cp_size, mock_cp_grp, mock_tp_grp, mock_tp_rank
):
    """Test _counteract_mcore_loss_averaging with cp_normalize=False, num_microbatches>1."""
    from nemo_rl.models.megatron.train import LossPostProcessor

    mock_loss_fn = MagicMock(return_value=(torch.tensor(1.0), {}))
    cfg = {"sequence_packing": {"enabled": False}}
    mock_tp_grp.return_value = MagicMock()
    mock_cp_grp.return_value = MagicMock()

    # cp_size=2, num_microbatches=4 → scaling = 4/2 = 2.0
    processor = LossPostProcessor(
        loss_fn=mock_loss_fn, cfg=cfg, num_microbatches=4, cp_normalize=False
    )
    wrapped_fn = processor(data_dict=MagicMock())
    loss, _ = wrapped_fn(torch.randn(1))
    # Megatron will then apply * cp_size / num_microbatches = 2/4 = 0.5
    # net result = 2.0 * 0.5 = 1.0 (original) — verify pre-scaling value here
    assert torch.isclose(loss, torch.tensor(2.0))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit/models/megatron/test_train.py` around lines 681 - 709, The current
test test_loss_post_processor_no_packing doesn't exercise
LossPostProcessor._counteract_mcore_loss_averaging because cp_size and
num_microbatches default to 1; add a new unit test variant that patches
get_context_parallel_world_size to return >1 (e.g., 2), sets num_microbatches >1
(e.g., 4) and cp_normalize=False, constructs
LossPostProcessor(loss_fn=MagicMock(return_value=(torch.tensor(1.0), {})),
cfg={"sequence_packing": {"enabled": False}}, num_microbatches=4,
cp_normalize=False), calls the wrapped function and asserts the returned loss
equals the expected pre-scaling value (i.e., the
_counteract_mcore_loss_averaging scaling factor num_microbatches/cp_size is
applied); reference the LossPostProcessor class,
_counteract_mcore_loss_averaging behavior, and the patched
get_context_parallel_world_size and num_microbatches/cp_size parameters to
locate and validate the logic.
nemo_rl/models/megatron/train.py (1)

325-345: Fix the dual cp_size assignment to improve code clarity, and verify Megatron's loss-scaling behavior in forward_only mode.

1. cp_size captured via stale closure reference — valid concern.
The closure _div_by_cp_size (line 328) captures cp_size by reference, and the variable is reassigned at line 337. Both calls to get_context_parallel_world_size() return the same value, making this functionally safe, but it's a maintenance risk. The suggested refactor to compute cp_size once before the if self.cp_normalize: block is sound:

♻️ Suggested refactor
+        cp_size = get_context_parallel_world_size()
         if self.cp_normalize:
-            cp_size = get_context_parallel_world_size()
             prev_loss_fn = loss_fn_wrapped

             def _div_by_cp_size(*args, **kwargs):
                 loss, metrics = prev_loss_fn(*args, **kwargs)
                 return loss / cp_size, metrics

             loss_fn_wrapped = _div_by_cp_size

         # Counteract Megatron's default loss averaging in schedules.py,
         # which applies (* cp_size / num_microbatches) to the loss.
-        cp_size = get_context_parallel_world_size()
         num_microbatches = self.num_microbatches

2. _counteract_mcore_loss_averaging is applied unconditionally.
This is accurate — the loss scaling loss * num_microbatches / cp_size has no conditional logic for forward_only mode. However, LossPostProcessor.__call__() does not receive the forward_only parameter, so conditional application would require architectural changes. The branch yifu/remove_do_not_average_loss suggests Megatron applies loss averaging uniformly, which would justify the unconditional counteracting. Confirm that Megatron-LM's forward_backward_func applies the loss scaling factor uniformly regardless of forward_only mode; if Megatron conditionally skips averaging during inference, the current approach will produce incorrectly scaled eval losses.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_rl/models/megatron/train.py` around lines 325 - 345, Compute cp_size
once before the if block and bind it into the closure to avoid the stale closure
reference: call get_context_parallel_world_size() a single time (store in
cp_size) before the if self.cp_normalize: block and ensure _div_by_cp_size
captures that cp_size (e.g., by referencing the local cp_size or binding it as a
default arg). Also avoid applying the mcore counter-scaling unconditionally: use
the already-computed num_microbatches and cp_size and only wrap loss_fn_wrapped
with _counteract_mcore_loss_averaging when in training mode (e.g., guard with if
not getattr(self, "forward_only", False) or if self.training), or if
forward_only isn’t available add a boolean parameter/attribute (e.g.,
forward_only) to LossPostProcessor.__call__/the containing object and use that
to skip the counter-scaling during forward-only/eval runs. Ensure you update
references to loss_fn_wrapped, _div_by_cp_size, and
_counteract_mcore_loss_averaging accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.gitmodules:
- Line 4: The .gitmodules entry pins the Megatron-LM submodule to a personal
feature branch ("branch = yifu/remove_do_not_average_loss") which is unstable;
update the submodule configuration so it points to the canonical branch (change
the branch field to "main") and ensure any intended changes are merged into
Megatron-LM main first; specifically edit the .gitmodules entry for the
Megatron-LM submodule (the branch = yifu/remove_do_not_average_loss line) and
replace it with branch = main, then update the submodule commit (git submodule
sync && git submodule update --init --remote) so the repo references the
upstream main tip.

In `@3rdparty/Megatron-LM-workspace/Megatron-LM`:
- Line 1: The .gitmodules entry for the Megatron-LM submodule is pinned to a
personal fork and branch (repository URL and branch name
yifu/remove_do_not_average_loss and SHA b12071b9...) which contradicts the PR
goal of using mcore main; update the submodule configuration in .gitmodules to
point to the official upstream repository and set the branch to "main" (or
remove the branch/sha pin), and remove or set shallow = false so full history is
available, then update the submodule commit (git submodule sync && git submodule
update --init --remote Megatron-LM) to reference an upstream main commit;
alternatively, if depending on the personal branch is intentional, update the PR
description to explicitly state the dependency on that fork/branch.

---

Outside diff comments:
In `@nemo_rl/models/policy/workers/megatron_policy_worker.py`:
- Line 1: Update the copyright year in the file header of
megatron_policy_worker.py from 2025 to 2026; locate the top-of-file copyright
comment and replace the year so the NVIDIA copyright header reflects 2026,
ensuring the header format remains identical otherwise.

---

Nitpick comments:
In `@nemo_rl/models/megatron/train.py`:
- Around line 325-345: Compute cp_size once before the if block and bind it into
the closure to avoid the stale closure reference: call
get_context_parallel_world_size() a single time (store in cp_size) before the if
self.cp_normalize: block and ensure _div_by_cp_size captures that cp_size (e.g.,
by referencing the local cp_size or binding it as a default arg). Also avoid
applying the mcore counter-scaling unconditionally: use the already-computed
num_microbatches and cp_size and only wrap loss_fn_wrapped with
_counteract_mcore_loss_averaging when in training mode (e.g., guard with if not
getattr(self, "forward_only", False) or if self.training), or if forward_only
isn’t available add a boolean parameter/attribute (e.g., forward_only) to
LossPostProcessor.__call__/the containing object and use that to skip the
counter-scaling during forward-only/eval runs. Ensure you update references to
loss_fn_wrapped, _div_by_cp_size, and _counteract_mcore_loss_averaging
accordingly.

In `@tests/unit/models/megatron/test_train.py`:
- Around line 681-709: The current test test_loss_post_processor_no_packing
doesn't exercise LossPostProcessor._counteract_mcore_loss_averaging because
cp_size and num_microbatches default to 1; add a new unit test variant that
patches get_context_parallel_world_size to return >1 (e.g., 2), sets
num_microbatches >1 (e.g., 4) and cp_normalize=False, constructs
LossPostProcessor(loss_fn=MagicMock(return_value=(torch.tensor(1.0), {})),
cfg={"sequence_packing": {"enabled": False}}, num_microbatches=4,
cp_normalize=False), calls the wrapped function and asserts the returned loss
equals the expected pre-scaling value (i.e., the
_counteract_mcore_loss_averaging scaling factor num_microbatches/cp_size is
applied); reference the LossPostProcessor class,
_counteract_mcore_loss_averaging behavior, and the patched
get_context_parallel_world_size and num_microbatches/cp_size parameters to
locate and validate the logic.

@yfw yfw added the CI:L1 Run doctests, unit tests, and functional tests label Feb 19, 2026
@yfw yfw requested review from terrykong and yaoyu-33 February 19, 2026 03:46
@terrykong terrykong enabled auto-merge (squash) February 19, 2026 03:57
@terrykong terrykong merged commit 84bede0 into main Feb 20, 2026
112 of 123 checks passed
@terrykong terrykong deleted the yifu/remove_do_not_average_loss branch February 20, 2026 00:15
yuanhangsu1986 pushed a commit to yuanhangsu1986/RL-Nemontron-Edge-Omni that referenced this pull request Feb 21, 2026
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
Signed-off-by: yuanhangs <yuanhangs@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI:L1 Run doctests, unit tests, and functional tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants