feat: NemotronH PP/EP SFT integration + fixed-length SQuAD supervision fix#1305
Open
jasont314 wants to merge 2 commits intoNVIDIA-NeMo:mainfrom
Open
feat: NemotronH PP/EP SFT integration + fixed-length SQuAD supervision fix#1305jasont314 wants to merge 2 commits intoNVIDIA-NeMo:mainfrom
jasont314 wants to merge 2 commits intoNVIDIA-NeMo:mainfrom
Conversation
Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com> perf lab Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com> perf lab (ep) Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com> better rope cache Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com> arithmetic intensity lab arithmetic intensity lab fix arithmetic intensity lab fix arithmetic intensity lab fix arithmetic intensity lab fix Revert "arithmetic intensity lab fix" This reverts commit b9a2c57. arithmetic intensity lab fix arithmetic intensity answer Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com> mixtral perf lab mixtral perf lab mixtral perf lab fix mixtral perf lab fix mixtral perf lab fix mixtral perf lab fix update grade_assignment update lab update Signed-off-by: Zhiyu Li <zhiyul@nvidia.com> arithmetic intensity answer Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com> remove reference remove reference Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>
akoumpa
reviewed
Feb 17, 2026
305a441 to
a74cfdf
Compare
32acb58 to
1173925
Compare
Integrates NemotronH PP/EP execution and safety guards, fixes fixed-length SQuAD label/tokenization masking so num_label_tokens stays > 0, and adds PP schedule/device/logging robustness updates. Includes optimized PP+EP SQuAD config, patch notes, training artifacts (baseline/optimized JSONL), and unit-test updates (79 passed, 5 skipped). Co-authored-by: Nazar Ospanov <aimogenius@berkeley.edu> Co-authored-by: Zoir Imomaliev <91550816+zimo0110@users.noreply.github.com> Co-authored-by: Sanjay Adhikesaven <sanjay.adhikesaven1@gmail.com> Signed-off-by: Jason Trinh <jasontrinh@berkeley.edu>
1173925 to
2a6fee4
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do ?
Enable stable NemotronH PP+EP SFT with fixed-length SQuAD by integrating the EP runtime path, hardening PP/FSDP synchronization/schedule behavior, and fixing label-token supervision masking so training no longer degenerates to NaN/zero-label-token steps.
Changelog
NEMOAUTOMODEL_PP_SKIP_OUTPUT_MERGEpatching with compatibility checks,torch.device | int | str->torch.device) and update call sites.tp * cp * pplogic.NEMOAUTOMODEL_FILTER_WARNINGS) for better debugging visibility.num_label_tokens == 0).examples/llm_finetune/nemotron/nemotron_nano_v3_pp_ep_squad.yamlNEMOTRON_PP_EP_SQUAD_PATCH_NOTES.mdmoe_meshsafety.checkpoints/baseline_training.jsonlcheckpoints/optimized_training.jsonlBefore your PR is "Ready for review"
Pre checks:
If you haven't finished some of the above items you can still open "Draft" PR.
Additional Information
python -m pytest -q tests/unit_tests/distributed/pipelining/test_functional.py tests/unit_tests/distributed/pipelining/test_autopipeline.py tests/unit_tests/moe/test_parallelizer.py79 passed, 5 skipped, 6 warningsAcknowledgements
Thanks to collaborators: