Open
Description
Bug description
When I use an tokenizer that has vocabulary size that is not divisible by parallel (or world) size, the training loss will become inconsistent after resuming.
Versions
can be reproduced using torch2.6
Reproduce:
Use any tokenizer that has a vocabulary size that is not divisible by parallel (or world) size.
train from step 0 to step 20:

