feat: NemotronH PP/EP SFT integration + fixed-length SQuAD supervision fix by jasont314 · Pull Request #1305 · NVIDIA-NeMo/Automodel

jasont314 · 2026-02-17T07:25:22Z

What does this PR do ?

Enable stable NemotronH PP+EP SFT with fixed-length SQuAD by integrating the EP runtime path, hardening PP/FSDP synchronization/schedule behavior, and fixing label-token supervision masking so training no longer degenerates to NaN/zero-label-token steps.

Changelog

Add NemotronH EP runtime integration and safety guards in distributed parallelization paths.
Add EP shard mesh null-guard behavior and mesh-aware EP axis propagation in training setup.
Add PP schedule robustness updates:
- explicit invalid-style error handling,
- guarded NEMOAUTOMODEL_PP_SKIP_OUTPUT_MERGE patching with compatibility checks,
- safer underfill handling interactions in tests.
Normalize AutoPipeline device handling (torch.device | int | str -> torch.device) and update call sites.
Fix FSDP2 divisibility error text to match actual tp * cp * pp logic.
Make warning filtering opt-in via env (NEMOAUTOMODEL_FILTER_WARNINGS) for better debugging visibility.
Fix fixed-length SQuAD supervision path:
- truncation-aware prompt/answer masking,
- fixed-length truncation behavior that preserves answer supervision tokens,
- avoid all-masked label cases (num_label_tokens == 0).
Add optimized SQuAD PP+EP config:
- examples/llm_finetune/nemotron/nemotron_nano_v3_pp_ep_squad.yaml
Add patch documentation:
- NEMOTRON_PP_EP_SQUAD_PATCH_NOTES.md
Add/update unit tests for:
- PP schedule/style/skip-merge behavior,
- AutoPipeline device normalization,
- EP shard-axis + missing moe_mesh safety.
Include SFT training artifacts for baseline vs optimized throughput comparison:
- checkpoints/baseline_training.jsonl
- checkpoints/optimized_training.jsonl

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?

If you haven't finished some of the above items you can still open "Draft" PR.

Additional Information

Related to: N/A
Validation run:
- python -m pytest -q tests/unit_tests/distributed/pipelining/test_functional.py tests/unit_tests/distributed/pipelining/test_autopipeline.py tests/unit_tests/moe/test_parallelizer.py
- Result: 79 passed, 5 skipped, 6 warnings

Acknowledgements

Thanks to collaborators:

Nazar Ospanov
Zoir Imomaliev
Sanjay Adhikesaven

Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com> perf lab Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com> perf lab (ep) Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com> better rope cache Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com> arithmetic intensity lab arithmetic intensity lab fix arithmetic intensity lab fix arithmetic intensity lab fix arithmetic intensity lab fix Revert "arithmetic intensity lab fix" This reverts commit b9a2c57. arithmetic intensity lab fix arithmetic intensity answer Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com> mixtral perf lab mixtral perf lab mixtral perf lab fix mixtral perf lab fix mixtral perf lab fix mixtral perf lab fix update grade_assignment update lab update Signed-off-by: Zhiyu Li <zhiyul@nvidia.com> arithmetic intensity answer Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com> remove reference remove reference Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>

copy-pr-bot · 2026-02-17T07:25:25Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

NEMOTRON_PP_EP_SQUAD_PATCH_NOTES.md

Integrates NemotronH PP/EP execution and safety guards, fixes fixed-length SQuAD label/tokenization masking so num_label_tokens stays > 0, and adds PP schedule/device/logging robustness updates. Includes optimized PP+EP SQuAD config, patch notes, training artifacts (baseline/optimized JSONL), and unit-test updates (79 passed, 5 skipped). Co-authored-by: Nazar Ospanov <aimogenius@berkeley.edu> Co-authored-by: Zoir Imomaliev <91550816+zimo0110@users.noreply.github.com> Co-authored-by: Sanjay Adhikesaven <sanjay.adhikesaven1@gmail.com> Signed-off-by: Jason Trinh <jasontrinh@berkeley.edu>

jasont314 requested review from a team, HuiyingLi, ZhiyuLi-Nvidia, adil-a, akoumpa and hemildesai as code owners February 17, 2026 07:25

github-actions bot added the community-request label Feb 17, 2026

akoumpa reviewed Feb 17, 2026

View reviewed changes

NEMOTRON_PP_EP_SQUAD_PATCH_NOTES.md Outdated Show resolved Hide resolved

akoumpa changed the title ~~NemotronH: PP/EP SFT integration + fixed-length SQuAD supervision fixPr1 nemotron pp ep squad~~ feat: NemotronH PP/EP SFT integration + fixed-length SQuAD supervision fixPr1 nemotron pp ep squad Feb 17, 2026

jasont314 force-pushed the pr1-nemotron-pp-ep-squad branch from 305a441 to a74cfdf Compare February 17, 2026 23:36

jasont314 changed the title ~~feat: NemotronH PP/EP SFT integration + fixed-length SQuAD supervision fixPr1 nemotron pp ep squad~~ feat: NemotronH PP/EP SFT integration + fixed-length SQuAD supervision fix Feb 17, 2026

jasont314 force-pushed the pr1-nemotron-pp-ep-squad branch 2 times, most recently from 32acb58 to 1173925 Compare February 18, 2026 03:20

jasont314 force-pushed the pr1-nemotron-pp-ep-squad branch from 1173925 to 2a6fee4 Compare February 18, 2026 03:25

jasont314 requested a review from akoumpa February 18, 2026 07:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: NemotronH PP/EP SFT integration + fixed-length SQuAD supervision fix#1305

feat: NemotronH PP/EP SFT integration + fixed-length SQuAD supervision fix#1305
jasont314 wants to merge 2 commits intoNVIDIA-NeMo:mainfrom
jasont314:pr1-nemotron-pp-ep-squad

jasont314 commented Feb 17, 2026 •

edited

Loading

Uh oh!

copy-pr-bot bot commented Feb 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

Conversation

jasont314 commented Feb 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Changelog

Before your PR is "Ready for review"

Additional Information

Acknowledgements

Uh oh!

copy-pr-bot bot commented Feb 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

jasont314 commented Feb 17, 2026 •

edited

Loading