Skip to content

Conversation

@HAOCHENYE
Copy link
Collaborator

xtuner.v1.model.BaseModel.save_hf using ProcessPoolExecutor to submit multiple saving tasks to accelerate the save speed. However, ProcessPoolExecutor.submit is nonblocking, it will accumulated lots of cpu tensor and will cause cpu oom.

@HAOCHENYE HAOCHENYE force-pushed the yehc/fix-save-ckpt-cpuoom branch 4 times, most recently from 3918f7c to 57dd84b Compare November 10, 2025 08:50
`xtuner.v1.model.BaseModel.save_hf` using `ProcessPoolExecutor` to
submit multiple saving tasks to accelerate the save speed. However, `ProcessPoolExecutor.submit` is nonblocking, it will accumulated lots of `cpu` tensor and will cause cpu oom.
@HAOCHENYE HAOCHENYE force-pushed the yehc/fix-save-ckpt-cpuoom branch from 57dd84b to 76e7bdc Compare November 11, 2025 07:40
@HAOCHENYE HAOCHENYE merged commit 9c8b388 into main Nov 11, 2025
5 of 6 checks passed
@HAOCHENYE HAOCHENYE deleted the yehc/fix-save-ckpt-cpuoom branch November 11, 2025 07:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants