-
Notifications
You must be signed in to change notification settings - Fork 1.1k
[megatron] Update megatron shells #6967
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[megatron] Update megatron shells #6967
Conversation
Summary of ChangesHello @Jintao-Huang, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request focuses on refining the Megatron integration within the Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request updates Megatron-related shell scripts and underlying Python code. The changes primarily focus on improving the model loading and saving mechanisms, particularly by introducing and defaulting to safetensors.
Key changes include:
- Updating documentation and example scripts to use
--modelinstead of--loadand enablingsafetensorsby default. - Modifying
MegatronArgumentsto support the newsafetensorslogic with smart defaults. - Refactoring the
save_checkpointlogic inMegatronTrainerto handlesafetensorsfor model weights while still using Megatron's checkpointing for optimizer and RNG states.
My main feedback is on the save_checkpoint implementation, where there appears to be a bug that could lead to redundant saving of model weights. I've left a detailed comment on this. Additionally, a defensive check has been added in utils.py which is a good improvement.
swift/megatron/trainers/base.py
Outdated
| if args.no_save_optim: | ||
| model = [] | ||
| with adapter_state_dict_context(is_peft_format=save_peft_format): | ||
| self._origin_save_checkpoint(iteration, model, *_args, **kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic appears to have a bug. When save_safetensors is true, model weights are already saved via self.bridge.save_weights. This block is intended to save only the optimizer and/or RNG states. However, if args.no_save_optim is False, the original model object is passed to _origin_save_checkpoint, which causes the model weights to be saved again in the Megatron checkpoint format. This is redundant and wastes storage.
To prevent this, the model variable should be set to an empty list ([]) unconditionally before calling _origin_save_checkpoint to signal that it should skip saving the model's state dictionary. The optimizer state can still be saved correctly as the optimizer object holds its own reference to the model if needed (e.g., for distributed optimizers).
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces significant updates to the Megatron integration, primarily focused on standardizing model loading via safetensors and improving checkpointing for resuming training. The changes are extensive, touching many example shell scripts to use --model instead of --load, and updating documentation to reflect the new practices. The core logic for saving and loading checkpoints has been refactored to support these new workflows, including better handling of LoRA adapters and optimizer states. While the changes are generally positive and well-aligned with the goal, I've identified a critical issue in the checkpoint loading logic that could prevent training resumption, along with a few minor inconsistencies and potential typos in the shell scripts and documentation. Addressing these points will ensure the new functionality is robust and reliable.
swift/megatron/trainers/base.py
Outdated
| with open(iteration_path, 'r') as f: | ||
| iteration = f.read() | ||
|
|
||
| common_path = os.path.join(ckpt_dir, f'iter_{5:07d}', 'common.pt') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The iteration number is hardcoded as 5 here. This will cause the function to always look for the checkpoint of iteration 5, instead of the latest one specified in latest_checkpointed_iteration.txt. This will prevent training from resuming correctly. The iteration variable should be used instead.
| common_path = os.path.join(ckpt_dir, f'iter_{5:07d}', 'common.pt') | |
| common_path = os.path.join(ckpt_dir, f'iter_{int(iteration):07d}', 'common.pt') |
…e_megatron_shells
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request significantly improves the user experience for Megatron-SWIFT by updating the workflow to support direct loading and saving of safetensors weights. This removes the need for a separate, explicit weight conversion step. The changes are comprehensive, touching documentation, example scripts, and core Python logic to support this new, streamlined process. The documentation in both English and Chinese has been updated to reflect these changes, and numerous example shell scripts have been modified to use the new arguments. The core logic for argument parsing, model loading, and checkpointing has been refactored to accommodate the new workflow. Overall, this is a great enhancement. I've found one minor typo in an example script that I've commented on.
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request significantly updates the Megatron integration by streamlining the workflow for loading and saving models. It transitions from a manual weight conversion process to directly using safetensors via Mcore-Bridge, which is a great improvement for user experience. The documentation and example scripts have been extensively updated to reflect this new recommended approach, while still maintaining the traditional method for backward compatibility. The core logic for handling checkpoints and arguments has been refactored accordingly. My review includes a critical bug fix for resuming training from safetensors checkpoints and a minor suggestion to improve code clarity.
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a significant and welcome update to the Megatron-SWIFT workflow by better integrating Mcore-Bridge. The changes streamline the training process by enabling direct loading and saving of safetensors weights, which simplifies the user experience by removing manual weight conversion steps. The documentation and example scripts have been thoroughly updated to reflect the new recommended workflow, distinguishing it from the traditional method. The core logic changes, especially around checkpointing and model loading, appear correct and robustly support features like resuming training from safetensors checkpoints. I have one minor suggestion for improving argument parsing consistency. Overall, this is an excellent enhancement that makes Megatron-SWIFT more user-friendly.
| if isinstance(self.ref_adapters, str): | ||
| self.ref_adapters = [self.ref_adapters] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For consistency and robustness, it would be good to also handle the case where self.adapters is a string, similar to how self.ref_adapters is handled. This ensures that if a single adapter path is passed as a string (e.g., from a config file), it is correctly converted to a list.
| if isinstance(self.ref_adapters, str): | |
| self.ref_adapters = [self.ref_adapters] | |
| if isinstance(self.adapters, str): | |
| self.adapters = [self.adapters] | |
| if isinstance(self.ref_adapters, str): | |
| self.ref_adapters = [self.ref_adapters] |
No description provided.