[megatron] Update megatron shells #6967

Jintao-Huang · 2025-12-09T08:17:57Z

No description provided.

gemini-code-assist · 2025-12-09T08:18:17Z

Summary of Changes

Hello @Jintao-Huang, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request focuses on refining the Megatron integration within the ms-swift framework by updating documentation and example configurations. The primary goal is to streamline model handling, particularly by transitioning towards Safetensors for loading and saving model weights, which offers improved security and efficiency. The changes also include necessary adjustments to argument parsing and checkpoint saving mechanisms to fully support this new approach, enhancing the overall robustness of large model training workflows.

Highlights

Documentation Update: Simplified environment setup instructions in both Chinese and English best practices documents by removing specific mentions of HF and MCore model weight conversion, streamlining the guidance for users.
Safetensors Adoption: Updated example training scripts in the documentation to explicitly use --model for specifying models and added --load_safetensors true --save_safetensors true, promoting the use of the Safetensors format for model loading and saving.
Megatron Argument Refinement: Adjusted the default behavior of load_safetensors and save_safetensors in ExtraMegatronArguments, making save_safetensors default to True and introducing logic to automatically set load_safetensors if not explicitly provided.
Checkpoint Saving Enhancement: Modified the save_checkpoint method in the base trainer to explicitly accept the model and introduced conditional logic to handle the saving process, including temporary adjustment of args.save and options to skip saving optimizer or RNG states.
Robust State Dictionary Generation: Added a check in generate_state_dict to ensure robustness by gracefully handling cases where the state dictionary might not contain a 'model' key, preventing potential errors during state dictionary processing.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request updates Megatron-related shell scripts and underlying Python code. The changes primarily focus on improving the model loading and saving mechanisms, particularly by introducing and defaulting to safetensors.

Key changes include:

Updating documentation and example scripts to use --model instead of --load and enabling safetensors by default.
Modifying MegatronArguments to support the new safetensors logic with smart defaults.
Refactoring the save_checkpoint logic in MegatronTrainer to handle safetensors for model weights while still using Megatron's checkpointing for optimizer and RNG states.

My main feedback is on the save_checkpoint implementation, where there appears to be a bug that could lead to redundant saving of model weights. I've left a detailed comment on this. Additionally, a defensive check has been added in utils.py which is a good improvement.

gemini-code-assist · 2025-12-09T08:21:08Z

swift/megatron/trainers/base.py

+                if args.no_save_optim:
+                    model = []
+                with adapter_state_dict_context(is_peft_format=save_peft_format):
+                    self._origin_save_checkpoint(iteration, model, *_args, **kwargs)


This logic appears to have a bug. When save_safetensors is true, model weights are already saved via self.bridge.save_weights. This block is intended to save only the optimizer and/or RNG states. However, if args.no_save_optim is False, the original model object is passed to _origin_save_checkpoint, which causes the model weights to be saved again in the Megatron checkpoint format. This is redundant and wastes storage.

To prevent this, the model variable should be set to an empty list ([]) unconditionally before calling _origin_save_checkpoint to signal that it should skip saving the model's state dictionary. The optimizer state can still be saved correctly as the optimizer object holds its own reference to the model if needed (e.g., for distributed optimizers).

Jintao-Huang · 2025-12-15T03:44:59Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces significant updates to the Megatron integration, primarily focused on standardizing model loading via safetensors and improving checkpointing for resuming training. The changes are extensive, touching many example shell scripts to use --model instead of --load, and updating documentation to reflect the new practices. The core logic for saving and loading checkpoints has been refactored to support these new workflows, including better handling of LoRA adapters and optimizer states. While the changes are generally positive and well-aligned with the goal, I've identified a critical issue in the checkpoint loading logic that could prevent training resumption, along with a few minor inconsistencies and potential typos in the shell scripts and documentation. Addressing these points will ensure the new functionality is robust and reliable.

gemini-code-assist · 2025-12-15T03:46:43Z

swift/megatron/trainers/base.py

+        with open(iteration_path, 'r') as f:
+            iteration = f.read()
+
+        common_path = os.path.join(ckpt_dir, f'iter_{5:07d}', 'common.pt')


The iteration number is hardcoded as 5 here. This will cause the function to always look for the checkpoint of iteration 5, instead of the latest one specified in latest_checkpointed_iteration.txt. This will prevent training from resuming correctly. The iteration variable should be used instead.

Suggested change

common_path = os.path.join(ckpt_dir, f'iter_{5:07d}', 'common.pt')

common_path = os.path.join(ckpt_dir, f'iter_{int(iteration):07d}', 'common.pt')

examples/megatron/lora/new_special_tokens.sh

docs/source/Megatron-SWIFT/Command-line-parameters.md

examples/megatron/rlhf/dpo/moe.sh

…e_megatron_shells

Jintao-Huang · 2025-12-16T08:21:06Z

/gemini review

gemini-code-assist

Code Review

This pull request significantly improves the user experience for Megatron-SWIFT by updating the workflow to support direct loading and saving of safetensors weights. This removes the need for a separate, explicit weight conversion step. The changes are comprehensive, touching documentation, example scripts, and core Python logic to support this new, streamlined process. The documentation in both English and Chinese has been updated to reflect these changes, and numerous example shell scripts have been modified to use the new arguments. The core logic for argument parsing, model loading, and checkpointing has been refactored to accommodate the new workflow. Overall, this is a great enhancement. I've found one minor typo in an example script that I've commented on.

examples/megatron/export/lora.sh

Jintao-Huang · 2025-12-16T08:28:25Z

/gemini review

gemini-code-assist

Code Review

This pull request significantly updates the Megatron integration by streamlining the workflow for loading and saving models. It transitions from a manual weight conversion process to directly using safetensors via Mcore-Bridge, which is a great improvement for user experience. The documentation and example scripts have been extensively updated to reflect this new recommended approach, while still maintaining the traditional method for backward compatibility. The core logic for handling checkpoints and arguments has been refactored accordingly. My review includes a critical bug fix for resuming training from safetensors checkpoints and a minor suggestion to improve code clarity.

swift/megatron/trainers/base.py

…tron_shells

Jintao-Huang · 2025-12-16T11:26:51Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces a significant and welcome update to the Megatron-SWIFT workflow by better integrating Mcore-Bridge. The changes streamline the training process by enabling direct loading and saving of safetensors weights, which simplifies the user experience by removing manual weight conversion steps. The documentation and example scripts have been thoroughly updated to reflect the new recommended workflow, distinguishing it from the traditional method. The core logic changes, especially around checkpointing and model loading, appear correct and robustly support features like resuming training from safetensors checkpoints. I have one minor suggestion for improving argument parsing consistency. Overall, this is an excellent enhancement that makes Megatron-SWIFT more user-friendly.

gemini-code-assist · 2025-12-16T11:30:17Z

swift/megatron/argument/megatron_args.py

+        if isinstance(self.ref_adapters, str):
+            self.ref_adapters = [self.ref_adapters]


For consistency and robustness, it would be good to also handle the case where self.adapters is a string, similar to how self.ref_adapters is handled. This ensures that if a single adapter path is passed as a string (e.g., from a config file), it is correctly converted to a list.

Suggested change

if isinstance(self.ref_adapters, str):

self.ref_adapters = [self.ref_adapters]

if isinstance(self.adapters, str):

self.adapters = [self.adapters]

if isinstance(self.ref_adapters, str):

self.ref_adapters = [self.ref_adapters]

Jintao-Huang added 2 commits December 9, 2025 16:16

update

495c427

Merge branch 'main' into update_megatron_shells

615b518

gemini-code-assist bot reviewed Dec 9, 2025

View reviewed changes

Jintao-Huang added 10 commits December 9, 2025 18:49

update

f4145ca

update

74c0466

update

3783cb6

Merge branch 'main' into update_megatron_shells

9d986ca

Merge branch 'main' into update_megatron_shells

981394c

update

eb93cb6

update

9ec0a3a

update

331b911

Merge branch 'main' into update_megatron_shells

6cda03c

fix

108c37f

gemini-code-assist bot reviewed Dec 15, 2025

View reviewed changes

hjh0119 approved these changes Dec 15, 2025

View reviewed changes

Jintao-Huang added 13 commits December 15, 2025 13:45

update

a0aec59

fix

418d6a1

update

25ffe2a

fix null_ref_context

f52cb46

fix

128e209

Merge remote-tracking branch 'origin/fix_null_ref_context' into updat…

af4dac2

…e_megatron_shells

fix

504e9d7

Merge branch 'main' into update_megatron_shells

80ab8cb

Merge branch 'main' into update_megatron_shells

826e563

Merge branch 'main' into update_megatron_shells

b79c051

update

3e901d6

fix

0ac3fb0

fix

d4c4bc4

Jintao-Huang added 11 commits December 16, 2025 11:20

fix

ef9fa52

fix

d24bc27

fix

0b781bb

update

8f9793f

Merge branch 'fix_megatron_seq_cls_bridge' into update_megatron_shells

319ba73

update

a925f63

update

d852c37

fix

58f3af0

Merge branch 'main' into update_megatron_shells

1ad352b

update

4f57472

fix

337d1cd

tastelikefeet approved these changes Dec 16, 2025

View reviewed changes

gemini-code-assist bot reviewed Dec 16, 2025

View reviewed changes

examples/megatron/export/lora.sh Outdated Show resolved Hide resolved

hjh0119 approved these changes Dec 16, 2025

View reviewed changes

Jintao-Huang added 3 commits December 16, 2025 16:25

update

64f1c85

update

25a1484

fix

2fc6653

gemini-code-assist bot reviewed Dec 16, 2025

View reviewed changes

swift/megatron/trainers/base.py Outdated Show resolved Hide resolved

swift/megatron/trainers/base.py Outdated Show resolved Hide resolved

Jintao-Huang added 7 commits December 16, 2025 16:45

fix

5b98c54

fix swift main

6a06731

fix

7e23b50

update

f1f51e4

Merge remote-tracking branch 'origin/fix_swift_main' into update_mega…

f9c1804

…tron_shells

fix

210b554

fix

9562a58

Jintao-Huang merged commit f143ad3 into modelscope:main Dec 16, 2025
2 of 3 checks passed

gemini-code-assist bot reviewed Dec 16, 2025

View reviewed changes

	common_path = os.path.join(ckpt_dir, f'iter_{5:07d}', 'common.pt')
	common_path = os.path.join(ckpt_dir, f'iter_{int(iteration):07d}', 'common.pt')

		if isinstance(self.ref_adapters, str):
		self.ref_adapters = [self.ref_adapters]

[megatron] Update megatron shells #6967

[megatron] Update megatron shells #6967

Uh oh!

Conversation

Jintao-Huang commented Dec 9, 2025

Uh oh!

gemini-code-assist bot commented Dec 9, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Dec 9, 2025

Choose a reason for hiding this comment

Uh oh!

Jintao-Huang commented Dec 15, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Jintao-Huang commented Dec 16, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Jintao-Huang commented Dec 16, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Jintao-Huang commented Dec 16, 2025

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants