We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is your feature request related to a problem? Please describe. Currently ChatLearn use a tool to convert checkpoint when different parallel strategies are detected, https://github.com/alibaba/ChatLearn/blob/main/chatlearn/utils/megatron_utils.py#L164
The online conversion has been addressed in Megatron core dist_checkpointing. https://docs.nvidia.com/megatron-core/developer-guide/latest/api-guide/dist_checkpointing.html
Describe the solution you'd like Use Megatron-core dist checkpointing to save and load checkpoint.
The text was updated successfully, but these errors were encountered:
No branches or pull requests
Is your feature request related to a problem? Please describe.
Currently ChatLearn use a tool to convert checkpoint when different parallel strategies are detected, https://github.com/alibaba/ChatLearn/blob/main/chatlearn/utils/megatron_utils.py#L164
The online conversion has been addressed in Megatron core dist_checkpointing. https://docs.nvidia.com/megatron-core/developer-guide/latest/api-guide/dist_checkpointing.html
Describe the solution you'd like
Use Megatron-core dist checkpointing to save and load checkpoint.
The text was updated successfully, but these errors were encountered: