Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FlashCheckpoint] support EMA #9815

Open
wants to merge 3 commits into
base: incubate/paddlenlp-fleety
Choose a base branch
from

Conversation

Meiyim
Copy link
Contributor

@Meiyim Meiyim commented Jan 23, 2025

在 FlashCheckpoint 中对 model param 和 optimizer param直接进行 EMA 操作。实现异步 EMA。具体来说:

  • 全程异步在 CPU 端进行,没有不阻塞训练。没有额外开销。
  • 支持非 Pipeline 网络
  • EMA 的 buffer 由FlashCheckpointWorker 持有,在 resume 的时候支持加载之前 EMA的状态。
  • EMA 只对 Optimizer中的 master_weights和 model_param 中的 dtype == float32 的 param 进行。
  • 新增的 EMA 状态会保存在 checkpoint目录中的ema... 文件中,存储格式示例如下:
output/mini-4p5v-resume/checkpoint-10/
├── config.json
├── ema.tp00_pp00_shard00.pdopt
├── ema.tp00_pp00_shard01.pdopt
├── ema.tp00_pp01_shard00.pdopt
├── ema.tp00_pp01_shard01.pdopt
├── ema.tp01_pp00_shard00.pdopt
├── ema.tp01_pp00_shard01.pdopt
├── ema.tp01_pp01_shard00.pdopt
├── ema.tp01_pp01_shard01.pdopt
├── model_meta.json
├── model_state.tp00_pp00_shard00.pdparams
├── model_state.tp00_pp00_shard01.pdparams
├── model_state.tp00_pp01_shard00.pdparams
├── model_state.tp00_pp01_shard01.pdparams
├── model_state.tp01_pp00_shard00.pdparams
├── model_state.tp01_pp00_shard01.pdparams
├── model_state.tp01_pp01_shard00.pdparams
├── model_state.tp01_pp01_shard01.pdparams
├── optimizer.tp00_pp00_shard00.pdopt
├── optimizer.tp00_pp00_shard01.pdopt
├── optimizer.tp00_pp01_shard00.pdopt
├── optimizer.tp00_pp01_shard01.pdopt
├── optimizer.tp01_pp00_shard00.pdopt
├── optimizer.tp01_pp00_shard01.pdopt
├── optimizer.tp01_pp01_shard00.pdopt
├── optimizer.tp01_pp01_shard01.pdopt
├── saved_signal_0
├── saved_signal_1
├── saved_signal_2
├── saved_signal_3
├── saved_signal_4
├── saved_signal_5
├── saved_signal_6
├── saved_signal_7
├── scheduler.pdparams
├── trainer_state.json
└── training_args.bin

目前已知的局限性:

  • reshard 改变 sharding-degree后,无法加载之前的状态。

Copy link

paddle-bot bot commented Jan 23, 2025

Thanks for your contribution!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants