Skip to content

feat: NemotronH PP/EP SFT integration + fixed-length SQuAD supervision fix#1305

Open
jasont314 wants to merge 2 commits intoNVIDIA-NeMo:mainfrom
jasont314:pr1-nemotron-pp-ep-squad
Open

feat: NemotronH PP/EP SFT integration + fixed-length SQuAD supervision fix#1305
jasont314 wants to merge 2 commits intoNVIDIA-NeMo:mainfrom
jasont314:pr1-nemotron-pp-ep-squad

Conversation

@jasont314
Copy link

@jasont314 jasont314 commented Feb 17, 2026

What does this PR do ?

Enable stable NemotronH PP+EP SFT with fixed-length SQuAD by integrating the EP runtime path, hardening PP/FSDP synchronization/schedule behavior, and fixing label-token supervision masking so training no longer degenerates to NaN/zero-label-token steps.

Changelog

  • Add NemotronH EP runtime integration and safety guards in distributed parallelization paths.
  • Add EP shard mesh null-guard behavior and mesh-aware EP axis propagation in training setup.
  • Add PP schedule robustness updates:
    • explicit invalid-style error handling,
    • guarded NEMOAUTOMODEL_PP_SKIP_OUTPUT_MERGE patching with compatibility checks,
    • safer underfill handling interactions in tests.
  • Normalize AutoPipeline device handling (torch.device | int | str -> torch.device) and update call sites.
  • Fix FSDP2 divisibility error text to match actual tp * cp * pp logic.
  • Make warning filtering opt-in via env (NEMOAUTOMODEL_FILTER_WARNINGS) for better debugging visibility.
  • Fix fixed-length SQuAD supervision path:
    • truncation-aware prompt/answer masking,
    • fixed-length truncation behavior that preserves answer supervision tokens,
    • avoid all-masked label cases (num_label_tokens == 0).
  • Add optimized SQuAD PP+EP config:
    • examples/llm_finetune/nemotron/nemotron_nano_v3_pp_ep_squad.yaml
  • Add patch documentation:
    • NEMOTRON_PP_EP_SQUAD_PATCH_NOTES.md
  • Add/update unit tests for:
    • PP schedule/style/skip-merge behavior,
    • AutoPipeline device normalization,
    • EP shard-axis + missing moe_mesh safety.
  • Include SFT training artifacts for baseline vs optimized throughput comparison:
    • checkpoints/baseline_training.jsonl
    • checkpoints/optimized_training.jsonl

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?

If you haven't finished some of the above items you can still open "Draft" PR.

Additional Information

  • Related to: N/A
  • Validation run:
    • python -m pytest -q tests/unit_tests/distributed/pipelining/test_functional.py tests/unit_tests/distributed/pipelining/test_autopipeline.py tests/unit_tests/moe/test_parallelizer.py
    • Result: 79 passed, 5 skipped, 6 warnings

Acknowledgements

Thanks to collaborators:

  • Nazar Ospanov
  • Zoir Imomaliev
  • Sanjay Adhikesaven

Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>

perf lab

Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>

perf lab (ep)

Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>

better rope cache

Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>

arithmetic intensity lab

arithmetic intensity lab fix

arithmetic intensity lab fix

arithmetic intensity lab fix

arithmetic intensity lab fix

Revert "arithmetic intensity lab fix"

This reverts commit b9a2c57.

arithmetic intensity lab fix

arithmetic intensity answer

Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>

mixtral perf lab

mixtral perf lab

mixtral perf lab fix

mixtral perf lab fix

mixtral perf lab fix

mixtral perf lab fix

update grade_assignment

update lab

update

Signed-off-by: Zhiyu Li <zhiyul@nvidia.com>

arithmetic intensity answer

Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>

remove reference

remove reference

Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>
@copy-pr-bot
Copy link

copy-pr-bot bot commented Feb 17, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@akoumpa akoumpa changed the title NemotronH: PP/EP SFT integration + fixed-length SQuAD supervision fixPr1 nemotron pp ep squad feat: NemotronH PP/EP SFT integration + fixed-length SQuAD supervision fixPr1 nemotron pp ep squad Feb 17, 2026
@jasont314 jasont314 force-pushed the pr1-nemotron-pp-ep-squad branch from 305a441 to a74cfdf Compare February 17, 2026 23:36
@jasont314 jasont314 changed the title feat: NemotronH PP/EP SFT integration + fixed-length SQuAD supervision fixPr1 nemotron pp ep squad feat: NemotronH PP/EP SFT integration + fixed-length SQuAD supervision fix Feb 17, 2026
@jasont314 jasont314 force-pushed the pr1-nemotron-pp-ep-squad branch 2 times, most recently from 32acb58 to 1173925 Compare February 18, 2026 03:20
Integrates NemotronH PP/EP execution and safety guards, fixes fixed-length SQuAD label/tokenization masking so num_label_tokens stays > 0, and adds PP schedule/device/logging robustness updates. Includes optimized PP+EP SQuAD config, patch notes, training artifacts (baseline/optimized JSONL), and unit-test updates (79 passed, 5 skipped).

Co-authored-by: Nazar Ospanov <aimogenius@berkeley.edu>
Co-authored-by: Zoir Imomaliev <91550816+zimo0110@users.noreply.github.com>
Co-authored-by: Sanjay Adhikesaven <sanjay.adhikesaven1@gmail.com>
Signed-off-by: Jason Trinh <jasontrinh@berkeley.edu>
@jasont314 jasont314 force-pushed the pr1-nemotron-pp-ep-squad branch from 1173925 to 2a6fee4 Compare February 18, 2026 03:25
@jasont314 jasont314 requested a review from akoumpa February 18, 2026 07:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

Comments