[Feature]Use Megatron-core dist_checkpointing to load checkpoint with different parallel strategies #169

SeaOfOcean · 2024-12-05T08:54:58Z

Is your feature request related to a problem? Please describe.
Currently ChatLearn use a tool to convert checkpoint when different parallel strategies are detected, https://github.com/alibaba/ChatLearn/blob/main/chatlearn/utils/megatron_utils.py#L164

Describe the solution you'd like
Use Megatron-core dist checkpointing to save and load checkpoint.

Provide feedback