Inconsistent loss when resume training with vocab size that is not divisible by world size.

### Bug description

When I use an tokenizer that has vocabulary size that is not divisible by parallel (or world) size, the training loss will become inconsistent after resuming.


### Versions

can be reproduced using torch2.6


Reproduce:
Use any tokenizer that has a vocabulary size that is not divisible by parallel (or world) size.
train from step 0 to step 20:

<img width="1368" alt="Image" src="https://github.com/user-attachments/assets/3e70be47-e15f-469f-b8b6-fbeb3ea6eab6" />
load step 10 checkpoint and resume training:

<img width="1421" alt="Image" src="https://github.com/user-attachments/assets/70b595f1-a034-4e29-816d-4fa9565f1648" />
As shown, step 11 and following steps have inconsistent loss.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Inconsistent loss when resume training with vocab size that is not divisible by world size. #1136

Bug description

Versions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Inconsistent loss when resume training with vocab size that is not divisible by world size. #1136

Description

Bug description

Versions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions