[VLM] Add token-imbalance loss #1803

lkhphuc · 2025-10-06T14:19:08Z

In VLM interleaved training, with native resolution and aspect ratio, the number of tokens participating in loss computation differ per rank. Naive FSDP gradient averaging across data ranks can causes tokens on ranks with fewer valid tokens to contribute more to the loss than on other ranks.
This PR address this via loss balancing, which incur an additional comm in the loss computation.
In practice, I haven't notice any impacts from this comm.

Quick sanity check

Let have a sum loss of all tokens on each rank i, with $N_i$ number of tokens $L_i = \sum_{j=1}^{N_i}\ell_{ij}$ and its gradient $g_i = \sum_{j=1}^{N_i}\nabla\ell_{ij}$

If we multiply the loss on each rank by a constant factor c (the same for all ranks), then after backward():

$$ \tilde g_i = c \cdot g_i . $$

FSDP will average these gradients across ranks:

$$ g_{\text{FSDP}}=\frac{1}{R}\sum_{i=1}^{R} \tilde g_i =\frac{c}{R}\sum_{i=1}^{R} g_i . $$

We want this to equal the global‑sample average:

$$ g_{\text{true}} =\frac{1}{N_{\text{total}}}\sum_{i=1}^{R}\sum_{j=1}^{N_i}\nabla \ell_{ij} =\frac{1}{N_{\text{total}}}\sum_{i=1}^{R} g_i . $$

Thus for FSDP gradient to be correct, we need

$$ \frac{c}{R}= \frac{1}{N_{\text{total}}}\quad\Longrightarrow\quad c=\frac{R}{N_{\text{total}}}. $$

So the right scaling factor is $R/N_{\text{total}}$, which mean divide the per-rank sum loss with $N_{\text{total}}/R$, which is average number of tokens per rank.
Intuitively, this is the same as default cross-entropy loss, but instead of diving sum loss on a rank by the number of tokens on that rank, we now divide by the average number of tokens across all rank

P/s: sorry this PR is based on #1802 but I couldn't choose that as the base branch. Maybe it will be easier to review once that PR is merged.

tianyu-l · 2025-10-08T22:40:20Z

torchtitan/experiments/vlm/infra/loss.py

+        ignore_index=IGNORE_INDEX,
+    )
+    num_tokens = (labels != IGNORE_INDEX).sum()
+    avg_num_tokens_per_rank = dist_mean(num_tokens, token_mesh)


For this dist_mean call, it seems it'll trigger a GPU/CPU sync

torchtitan/torchtitan/distributed/utils.py

Line 50 in 21739fd

return funcol.all_reduce(x, reduceOp=reduceOp, group=mesh).item()

I think this will potentially bring unnecessary perf issues, as without with CPU can stay way ahead of GPU.

I'd recommend refactoring the .item() call outside _dist_reduce and put them into callsites. Alternatively you can directly call funcol.all_reduce() here.

cc @fegin this won't work with FT as the ft_pg is not visible to this loss function.

I added an extra process group mimicking the distributed utils. Hope that makes sense.

tianyu-l · 2025-10-08T22:52:46Z

torchtitan/experiments/vlm/infra/loss.py

+    )
+    num_tokens = (labels != IGNORE_INDEX).sum()
+    avg_num_tokens_per_rank = dist_mean(num_tokens, token_mesh)
+    return sum_loss / avg_num_tokens_per_rank


This is a cute & "mostly correct" way to deal with the imbalanced token loss issue. However,

it's not the most readable one

moreover, I don't think it is correct if gradient accumulation is enabled, as each microbatch can have different amount of "avg_num_tokens"

I think the best way should be

don't let FSDP do implicit gradient division

always run cross entropy with reduction="sum"

let data loader / trainer count the number of tokens involving in loss computation, e.g. by explicitly doing num_tokens = (labels != IGNORE_INDEX).sum() on each rank. (I agree that without imbalance we don't need to do this and the followed communication.)

This way we also don't need this ad hoc call https://github.com/pytorch/torchtitan/blob/main/torchtitan/experiments/llama4/infra/parallelize.py#L363

We don't need to do this refactor for now, but it would be good if you could leave a TODO item here + file an issue.

cc @ezyang @fegin

I filed #1842

tianyu-l · 2025-10-08T22:53:35Z

torchtitan/experiments/vlm/infra/loss.py

+from torchtitan.tools.logging import logger
+
+
+IGNORE_INDEX = -100


Let's avoid defining IGNORE_INDEX in two places when they actually need to be shared.
The other appearance is in torchtitan/experiments/vlm/datasets/mm_collator_nld.py

I keep the def in loss.py as it makes most sense and import to the dataloader files.

Actually I delete the other PR #1802 from this PR as they are fairly independent to make it cleaner to merge.
I will fix the other PR to import IGNORE_INDEX depend on which ones land first.

tianyu-l · 2025-10-08T22:53:56Z

torchtitan/experiments/vlm/infra/loss.py

+    each rank computes the loss over **only its local tokens** and returns an
+    *average* over those tokens:
+
+    Afterwards, when Fully‑Sharded Data Parallel (FSDP) averages the gradients


Thanks for writing up the docstring, looks very good.

tianyu-l

LGTM

In VLM interleaved training, with native resolution and aspect ratio, the number of tokens participating in loss computation differ per rank. Naive FSDP gradient averaging across data ranks can causes tokens on ranks with fewer valid tokens to contribute more to the loss than on other ranks. This PR address this via loss balancing, which incur an additional comm in the loss computation. In practice, I haven't notice any impacts from this comm. #### Quick sanity check Let have a sum loss of all tokens on each rank i, with $N_i$ number of tokens $L_i = \sum_{j=1}^{N_i}\ell_{ij}$ and its gradient $g_i = \sum_{j=1}^{N_i}\nabla\ell_{ij}$ If we multiply the *loss* on each rank by a constant factor **c** (the same for all ranks), then after `backward()`: $$ \tilde g_i = c \cdot g_i . $$ FSDP will *average* these gradients across ranks: $$ g_{\text{FSDP}}=\frac{1}{R}\sum_{i=1}^{R} \tilde g_i =\frac{c}{R}\sum_{i=1}^{R} g_i . $$ We want this to equal the **global‑sample average**: $$ g_{\text{true}} =\frac{1}{N_{\text{total}}}\sum_{i=1}^{R}\sum_{j=1}^{N_i}\nabla \ell_{ij} =\frac{1}{N_{\text{total}}}\sum_{i=1}^{R} g_i . $$ Thus for FSDP gradient to be correct, we need $$ \frac{c}{R}= \frac{1}{N_{\text{total}}}\quad\Longrightarrow\quad c=\frac{R}{N_{\text{total}}}. $$ So the *right* scaling factor is $R/N_{\text{total}}$, which mean divide the per-rank sum loss with $N_{\text{total}}/R$, which is **average number of tokens per rank**. Intuitively, this is the same as default cross-entropy loss, but instead of diving sum loss on a rank by the number of tokens **on that rank**, we now divide by the **average number of tokens across all rank** P/s: sorry this PR is based on pytorch#1802 but I couldn't choose that as the base branch. Maybe it will be easier to review once that PR is merged.

lkhphuc requested review from fegin, tianyu-l, wconstab and wwwjn as code owners October 6, 2025 14:19

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 6, 2025

lkhphuc force-pushed the vlm-loss branch 2 times, most recently from 0ec62b1 to 9015c60 Compare October 8, 2025 05:55

tianyu-l reviewed Oct 8, 2025

View reviewed changes

tianyu-l mentioned this pull request Oct 8, 2025

[simplefsdp] fix simplefsdp gradient_divide_factor #1793

Merged

lkhphuc mentioned this pull request Oct 9, 2025

More robust distributed loss #1842

Open

lkhphuc force-pushed the vlm-loss branch 2 times, most recently from 8eb28c1 to 31dd8d9 Compare October 9, 2025 10:04

[VLM] token-imbalance loss

1c949aa

lkhphuc force-pushed the vlm-loss branch from 31dd8d9 to 1c949aa Compare October 9, 2025 10:09

tianyu-l approved these changes Oct 9, 2025

View reviewed changes

tianyu-l merged commit 98d904f into pytorch:main Oct 9, 2025
8 checks passed

lkhphuc deleted the vlm-loss branch October 10, 2025 02:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[VLM] Add token-imbalance loss #1803

[VLM] Add token-imbalance loss #1803

Uh oh!

lkhphuc commented Oct 6, 2025 •

edited

Loading

Uh oh!

tianyu-l Oct 8, 2025

Uh oh!

lkhphuc Oct 9, 2025

Uh oh!

tianyu-l Oct 8, 2025 •

edited

Loading

Uh oh!

lkhphuc Oct 9, 2025

Uh oh!

tianyu-l Oct 8, 2025

Uh oh!

lkhphuc Oct 9, 2025

Uh oh!

lkhphuc Oct 9, 2025 •

edited

Loading

Uh oh!

tianyu-l Oct 8, 2025

Uh oh!

tianyu-l left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		from torchtitan.tools.logging import logger


		IGNORE_INDEX = -100

[VLM] Add token-imbalance loss #1803

[VLM] Add token-imbalance loss #1803

Uh oh!

Conversation

lkhphuc commented Oct 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Quick sanity check

Uh oh!

tianyu-l Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

lkhphuc Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

tianyu-l Oct 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lkhphuc Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

tianyu-l Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

lkhphuc Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

lkhphuc Oct 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tianyu-l Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

tianyu-l left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

lkhphuc commented Oct 6, 2025 •

edited

Loading

tianyu-l Oct 8, 2025 •

edited

Loading

lkhphuc Oct 9, 2025 •

edited

Loading