[FlashCheckpoint] support EMA #9815

Meiyim · 2025-01-23T08:51:17Z

在 FlashCheckpoint 中对 model param 和 optimizer param直接进行 EMA 操作。实现异步 EMA。具体来说：

全程异步在 CPU 端进行，没有不阻塞训练。没有额外开销。
支持非 Pipeline 网络
EMA 的 buffer 由FlashCheckpointWorker 持有，在 resume 的时候支持加载之前 EMA的状态。
EMA 只对 Optimizer中的 master_weights和 model_param 中的 dtype == float32 的 param 进行。
新增的 EMA 状态会保存在 checkpoint目录中的ema... 文件中，存储格式示例如下：

output/mini-4p5v-resume/checkpoint-10/
├── config.json
├── ema.tp00_pp00_shard00.pdopt
├── ema.tp00_pp00_shard01.pdopt
├── ema.tp00_pp01_shard00.pdopt
├── ema.tp00_pp01_shard01.pdopt
├── ema.tp01_pp00_shard00.pdopt
├── ema.tp01_pp00_shard01.pdopt
├── ema.tp01_pp01_shard00.pdopt
├── ema.tp01_pp01_shard01.pdopt
├── model_meta.json
├── model_state.tp00_pp00_shard00.pdparams
├── model_state.tp00_pp00_shard01.pdparams
├── model_state.tp00_pp01_shard00.pdparams
├── model_state.tp00_pp01_shard01.pdparams
├── model_state.tp01_pp00_shard00.pdparams
├── model_state.tp01_pp00_shard01.pdparams
├── model_state.tp01_pp01_shard00.pdparams
├── model_state.tp01_pp01_shard01.pdparams
├── optimizer.tp00_pp00_shard00.pdopt
├── optimizer.tp00_pp00_shard01.pdopt
├── optimizer.tp00_pp01_shard00.pdopt
├── optimizer.tp00_pp01_shard01.pdopt
├── optimizer.tp01_pp00_shard00.pdopt
├── optimizer.tp01_pp00_shard01.pdopt
├── optimizer.tp01_pp01_shard00.pdopt
├── optimizer.tp01_pp01_shard01.pdopt
├── saved_signal_0
├── saved_signal_1
├── saved_signal_2
├── saved_signal_3
├── saved_signal_4
├── saved_signal_5
├── saved_signal_6
├── saved_signal_7
├── scheduler.pdparams
├── trainer_state.json
└── training_args.bin

目前已知的局限性:

reshard 改变 sharding-degree后，无法加载之前的状态。

paddle-bot · 2025-01-23T08:51:22Z

Thanks for your contribution!

[FlashCheckpoint] support EMA

5f7a282

paddle-bot bot added the contributor label Jan 23, 2025

paddle-bot bot assigned wawltor Jan 23, 2025

support non-pp

94246f6

Meiyim force-pushed the fc-ema branch from bb1fc70 to 0297880 Compare January 25, 2025 04:13

refactor: move [flash checkpoint manager] to callback

4c15ff2

Meiyim force-pushed the fc-ema branch from 0297880 to 4c15ff2 Compare January 31, 2025 05:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FlashCheckpoint] support EMA #9815

[FlashCheckpoint] support EMA #9815

Meiyim commented Jan 23, 2025 •

edited

Loading

paddle-bot bot commented Jan 23, 2025

[FlashCheckpoint] support EMA #9815

Are you sure you want to change the base?

[FlashCheckpoint] support EMA #9815

Conversation

Meiyim commented Jan 23, 2025 • edited Loading

paddle-bot bot commented Jan 23, 2025

Meiyim commented Jan 23, 2025 •

edited

Loading