Second version of degub/deterministic configs. #1807

githubsgi · 2025-10-07T20:51:21Z

Incorporating input from converastion in #1761

Incorporating input from converastion in pytorch#1761

githubsgi · 2025-10-07T20:51:43Z

@tianyu-l , please review.

tianyu-l

Thanks! Left some comments.

torchtitan/config/job_config.py

torchtitan/distributed/activation_checkpoint.py

torchtitan/train.py

torchtitan/config/job_config.py

tianyu-l

Could you do us a favor and update
https://github.com/pytorch/torchtitan/blob/main/torchtitan/experiments/flux/train.py
https://github.com/pytorch/torchtitan/blob/main/torchtitan/experiments/forge/engine.py
https://github.com/pytorch/torchtitan/blob/main/torchtitan/experiments/forge/job_config.py

torchtitan/distributed/activation_checkpoint.py

torchtitan/experiments/llama4/train_configs/llama4_17bx16e.toml

docs/debugging.md

tianyu-l · 2025-10-12T06:53:39Z

torchtitan/train.py

+
+try:
+    import intel_extension_for_pytorch as ipex
+    print ( f"IPEX found - hence using IPEX")
+except:
+    print ( f"IPEX not found, hence not using")
+


Sorry what's this for? Could we remove it?

also please rebase to resolve conflict

…ch#1804) ## Benchmarking <meta charset="utf-8"><b style="font-weight:normal;" id="docs-internal-guid-852d634c-7fff-a3ae-72e8-d17e64bb4b2c"><div dir="ltr" style="margin-left:0pt;" align="center"> Step | time | log -- | -- | -- to_hf() | 0.1103s | [trainer0\|0]:[titan] 2025-10-03 17:07:45,697 - root - INFO - Completed to_hf conversion, generated 189 keys, duration: 0.1103s Split local GroupedExperts DTensor to individual experts’ weight | 0.008 s per layer per matrix (total 58 MoE Layers * 3 weight matrices per layer) | [trainer0\|0]:[titan] 2025-10-03 17:07:45,697 - root - INFO - Completed _get_local_experts_weights for layer 6, abstract_key: model.layers.{}.mlp.experts.{}.up_proj.weight, duration: 0.0082s dcp.load()Threads count=4 | 193.20s | [trainer0\|0]:[titan] 2025-10-03 17:10:58,899 - root - INFO - dcp.load with HuggingFaceStorageReader completed in 193.20 seconds from_hf() | 0.48s | [trainer0\|0]:[titan] 2025-10-03 17:10:59,378 - root - INFO - Completed from_hf conversion, processed 189 keys, duration: 0.4787s Concatenate individual experts weight into GroupedExperts weight | 0.01s per layer per matrix (total 58 MoE Layers * 3 weight matrices) | [trainer0\|0]:[titan] 2025-10-03 17:10:59,120 - root - INFO - Completed _concatenate_expert_weights_dtensor for layer 5, abstract_key: layers.{}.moe.experts.w2, duration: 0.0142s Total | 193.87s | [trainer0\|0]:[titan] 2025-10-03 17:10:59,458 - root - INFO - Finished loading the checkpoint in 193.87 seconds. </div></b> ## End-to-End verification for 671B model Parallelsim: FSDP=32, PP=8, 1F1B, EP=32 <img width="393" height="421" alt="Screenshot 2025-10-06 at 8 32 37 PM" src="https://github.com/user-attachments/assets/6d8dab00-a188-4c57-8348-02bae1d21d03" /> <img width="393" height="421" alt="Screenshot 2025-10-06 at 8 32 54 PM" src="https://github.com/user-attachments/assets/a730f71b-3dc8-45e0-8d3e-b21080884f8d" />

…h#1808) With max-autotune, FlexAttention is not deterministic even if torch.use_deterministic_algorithms is True. When deterministic mode is set, we should also remove the usage of `max-autotune`.

Summary: allow users to specify the profiler schedule --- [//]: # (BEGIN SAPLING FOOTER) Stack created with [Sapling](https://sapling-scm.com). Best reviewed with [ReviewStack](https://reviewstack.dev/pytorch/torchtitan/pull/1809). * pytorch#1811 * pytorch#1810 * pytorch#1812 * __->__ pytorch#1809 Co-authored-by: Tushar Jain <[email protected]>

this PR is a followup of SimpleFSDP+EP [PR](pytorch#1529). Here, we add a `gradient_divide_factor` following FSDP2 to ensure modules wrapped by (FSDP+EP) has the correct gradient reduction value. - The original FSDP2 implementation is in this [PR](pytorch#1551). - The `gradient_divide_factor` logic is [here](https://github.com/pytorch/pytorch/blob/main/torch/distributed/fsdp/_fully_shard/_fsdp_collectives.py#L688) We have two ways of handling `gradient_divide_factor` in `reduce_scatter`: 1. The first one is to use `ReduceOp.PREMUL_SUM` to handle the `gradient_divide_factor`. However, DTensor's `_reduce_shard_value` only accepts `reduce_op` as a str input ([here](https://github.com/pytorch/pytorch/blob/8f705d019a64b1ca882e043b3eb98559273a9e59/torch/distributed/tensor/placement_types.py#L177-L210)). To make` _reduce_shard_value` work correctly with ReduceOp.PREMUL_SUM, we need to update the DTensor `_reduce_shard_tensor` and `torch.distributed._functional_collectives.reduce_scatter_tensor` so that it can pass the factor associated with ReduceOp.PREMUL_SUM as an input. 2. Another way is to simulate `ReduceOp.PREMUL_SUM` with `ReduceOp.SUM`. The logic is in this [Diff](https://www.internalfb.com/diff/D76546536). It does a `div_` over gradient before performing `ReduceOp.SUM`. Currently I'm following 2 since it is requires less change to `_functional_collectives`. After enabling `reduction_divide_factor`, we will see FSDP(=2) + EP (=4) have identical loss: <img width="1194" height="780" alt="Screenshot 2025-10-08 at 5 27 24 PM" src="https://github.com/user-attachments/assets/aaf83109-8db8-4051-973d-c7b6950513de" />

Llama 3.1 models use scaled RoPE by default, and Llama 4 17B x 16E uses scaled RoPE while 17B x 128E does not. 1. Verified forward parity between Titan Llama 3.1 8B and HuggingFace Llama 3.1 8B. The KL divergence of outputs from the same sample inputs is small. ![llama 3 8b forward parity small](https://github.com/user-attachments/assets/891df89b-006f-4ed0-a68a-36e939d2169b) For comparison, before adding scaled RoPE support, the forward parity check on the Llama 3.1 8B model incurred a slightly larger KL divergence on sample inputs. ![llama 3 8b forward parity without scaled rope](https://github.com/user-attachments/assets/9a68357a-34d4-497f-977f-27cc548d8f62) 2. Verified training of Llama 3.1 8B with tensor parallel degree = 4. ![llama 3-1 8b training tp=4](https://github.com/user-attachments/assets/a8b1ab10-0da0-4d02-afbb-a775716beaa3) 3. Verified training of Llama 4 debug model with scaled RoPE. ![llama 4 debug model training](https://github.com/user-attachments/assets/1fbf8939-31a5-475f-987c-d5bcf6d2376b)

The forge's doc build is failing with some formatting issues that seem to come from the torchtitan docstrings: ``` docstring of torchtitan.config.job_config.Parallelism.fsdp_reshard_after_forward:7: ERROR: Unexpected indentation. docstring of torchtitan.config.job_config.Parallelism.fsdp_reshard_after_forward:8: WARNING: Block quote ends without a blank line; unexpected unindent. docstring of torchtitan.config.job_config.Parallelism.expert_parallel_degree:4: ERROR: Unexpected indentation. docstring of torchtitan.config.job_config.Parallelism.expert_parallel_degree:7: WARNING: Block quote ends without a blank line; unexpected unindent. docstring of torchtitan.config.job_config.Parallelism.expert_parallel_degree:11: WARNING: Bullet list ends without a blank line; unexpected unindent. docstring of torchtitan.config.job_config.Checkpoint.async_mode:5: ERROR: Unexpected indentation. ``` Failing [job](https://github.com/meta-pytorch/forge/actions/runs/18360538773/job/52303073438?pr=336#step:11:73). This PR fixes those minor formatting issues.

Fix the number of layer issue introduced by pytorch#1804

Inspired by the blogpost: https://pytorch.org/blog/activation-checkpointing-techniques/

In VLM interleaved training, with native resolution and aspect ratio, the number of tokens participating in loss computation differ per rank. Naive FSDP gradient averaging across data ranks can causes tokens on ranks with fewer valid tokens to contribute more to the loss than on other ranks. This PR address this via loss balancing, which incur an additional comm in the loss computation. In practice, I haven't notice any impacts from this comm. #### Quick sanity check Let have a sum loss of all tokens on each rank i, with $N_i$ number of tokens $L_i = \sum_{j=1}^{N_i}\ell_{ij}$ and its gradient $g_i = \sum_{j=1}^{N_i}\nabla\ell_{ij}$ If we multiply the *loss* on each rank by a constant factor **c** (the same for all ranks), then after `backward()`: $$ \tilde g_i = c \cdot g_i . $$ FSDP will *average* these gradients across ranks: $$ g_{\text{FSDP}}=\frac{1}{R}\sum_{i=1}^{R} \tilde g_i =\frac{c}{R}\sum_{i=1}^{R} g_i . $$ We want this to equal the **global‑sample average**: $$ g_{\text{true}} =\frac{1}{N_{\text{total}}}\sum_{i=1}^{R}\sum_{j=1}^{N_i}\nabla \ell_{ij} =\frac{1}{N_{\text{total}}}\sum_{i=1}^{R} g_i . $$ Thus for FSDP gradient to be correct, we need $$ \frac{c}{R}= \frac{1}{N_{\text{total}}}\quad\Longrightarrow\quad c=\frac{R}{N_{\text{total}}}. $$ So the *right* scaling factor is $R/N_{\text{total}}$, which mean divide the per-rank sum loss with $N_{\text{total}}/R$, which is **average number of tokens per rank**. Intuitively, this is the same as default cross-entropy loss, but instead of diving sum loss on a rank by the number of tokens **on that rank**, we now divide by the **average number of tokens across all rank** P/s: sorry this PR is based on pytorch#1802 but I couldn't choose that as the base branch. Maybe it will be easier to review once that PR is merged.

…ytorch#1849) A test run on vlm debugmodel: <img width="1109" height="491" alt="image" src="https://github.com/user-attachments/assets/5763aa67-946e-4ab3-9ce8-e884fa3d1776" />

…ytorch#1776) Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.12.0) (oldest at bottom): * pytorch#1797 * __->__ pytorch#1776 **Status** 1. Change all models, including the experimental ones. 2. E2E loss verification. 3. We should add an unittest for attention. But since we don't have GPU unittest, this can be done in a separate PR. **Summary** This PR aims to refactor how TorchTitan build the attention masks and pass to model. Before this PR, init_attention_masks() is called in Trainer but the masks are stored as a class variable of FlexAttentionWrapper(). We chose this shortcut to support the case where a single model requires multiple masks. The previous design has several issues, one particular one is pytorch#1723. pytorch/pytorch#164111 proves that we can let PP split BlockMask, this PR performs the refactor to pass masks as an argument of model.forward(). The new design: 1. Model needs to provide `get_attention_masks()` that accepts `create_mask_fn`, `batch`, and `eos_id`. If the attention op is SDPA, then this API should return None as SDPA currently doesn't support varlen. But once it does, we may have to return some tuple of int that represents the mask. Justification: attention logic is technically a part of the model, but requires some information from trainer/dataloader. So it's model author's responsibility to provide some API that let trainer calls to get the masks. 2. `get_attention_masks()` will be called from the trainer and the resulting masks are passed to the model.forward(). Justification: this will allow us to fix pytorch#1723 with pytorch/pytorch#164111 and this PR. 3. Now SDPA and FlexAttention are wrapped in two different classes. ~~Note: we still have two very very thin op wrappers that are used for CP. I keep these two for the CP education purpose. But this certainly can be confusion for Titan's users. I'm opnn to merge them to AttentionOp.~~ See the discussion in pytorch#1723. **Verification** *llama3* ``` ./loss_compare.sh main 9dc16675b272ffdc3ed616e3244bcf7dc2d257f2 --steps=100 --no-seed-checkpoint --config="./torchtitan/models/llama3/train_configs/debug_model.toml" ``` *llama3 flex* ``` ./loss_compare.sh main 9dc16675b272ffdc3ed616e3244bcf7dc2d257f2 --steps=100 --no-seed-checkpoint --config="./torchtitan/models/llama3/train_configs/debug_model.toml" --baseline-train-options="--model.flavor=debugmodel_flex_attn" ``` *llama4* ``` ./loss_compare.sh main 9dc16675b272ffdc3ed616e3244bcf7dc2d257f2 --steps=100 --no-seed-checkpoint ``` *llama4 irope* ``` ./loss_compare.sh main 9dc16675b272ffdc3ed616e3244bcf7dc2d257f2 --steps=100 --no-seed-checkpoint ``` *deepseek* ``` ./loss_compare.sh main 9dc16675b272ffdc3ed616e3244bcf7dc2d257f2 --steps=100 --no-seed-checkpoint --config="./torchtitan/models/deepseek_v3/train_configs/debug_model.toml" ``` *deepseek flex* ``` ./loss_compare.sh main 9dc16675b272ffdc3ed616e3244bcf7dc2d257f2 --steps=100 --no-seed-checkpoint --config="./torchtitan/models/deepseek_v3/train_configs/debug_model.toml" --baseline-train-options="--model.flavor=debugmodel_flex_attn" ```

Summary: the script adds configuration options to run training locally with ft enabled --- [//]: # (BEGIN SAPLING FOOTER) Stack created with [Sapling](https://sapling-scm.com). Best reviewed with [ReviewStack](https://reviewstack.dev/pytorch/torchtitan/pull/1812). * pytorch#1840 * pytorch#1811 * pytorch#1810 * __->__ pytorch#1812 * pytorch#1809 --------- Co-authored-by: Tushar Jain <[email protected]>

pytorch#1850 removed `name` field in `TrainSpec`. The experiments in simple_fsdp should also be updated. Otherwise it won't run. pytorch#1776 added `use_flex_attn` field to `apply_non_moe_tp()`, which is missing in simple_fsdp experiments ``` NGPU=8 CONFIG_FILE=./torchtitan/models/llama3/train_configs/debug_model.toml ./run_train.sh --model.name simple_fsdp.llama3 --compile.enable ``` ``` NGPU=8 CONFIG_FILE=./torchtitan/models/deepseek_v3/train_configs/debug_model.toml ./run_train.sh --model.name simple_fsdp.deepseek_v3 --compile.enable ```

This PR: - let `ExpertParallel` handles indices permute / unpermute when EP is used - move `to_local` to model code to be more explicit - rename the `expert_parallel` wrapper which does permute / unpermute to `indices_permutation_wrapper` to be more accurate

Summary: Allows disabling the storage of checkpoints related to torchft. Users don't really have to rely on any external storage. So it reduces set up time to get things up and running. Since we also don't really need model checkpoints when we have torchft. And if checkpoint storage has issues, this can work as a killswitch to completely disable the storage so it doesn't impact training. --- [//]: # (BEGIN SAPLING FOOTER) Stack created with [Sapling](https://sapling-scm.com). Best reviewed with [ReviewStack](https://reviewstack.dev/pytorch/torchtitan/pull/1810). * pytorch#1856 * pytorch#1811 * __->__ pytorch#1810 Co-authored-by: Tushar Jain <[email protected]>

Add one light-weight CI for VLM

Next is step is to move `qwen3` and `llama4` to core, and remove outdated experiments.

As titled. Added CI for test, fix minor TP issue after adding attention_mask

githubsgi · 2025-10-13T20:26:04Z

@tianyu-l , do not review yet. Not sure the rebase was valid.

githubsgi · 2025-10-13T23:17:33Z

@tianyu-l , please review.

torchtitan/models/attention.py

torchtitan/config/job_config.py

torchtitan/experiments/flux/train.py

torchtitan/experiments/forge/engine.py

tianyu-l

LGTM, thanks a lot!

tianyu-l · 2025-10-14T03:04:07Z

@githubsgi Please address lint issues.

tianyu-l · 2025-10-14T20:04:08Z

@githubsgi
sorry it seems you need to rebase

Second version of degub/deterinistic configs.

bc86c52

Incorporating input from converastion in pytorch#1761

githubsgi requested review from fegin, tianyu-l, wconstab and wwwjn as code owners October 7, 2025 20:51

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 7, 2025

githubsgi changed the title ~~Second version of degub/deterinistic configs.~~ Second version of degub/deterministic configs. Oct 7, 2025

tianyu-l reviewed Oct 9, 2025

View reviewed changes

tianyu-l linked an issue Oct 9, 2025 that may be closed by this pull request

Adding options to enable some determinism related configs #1736

Open

Review relaeted updates.

be8b5cb

tianyu-l reviewed Oct 10, 2025

View reviewed changes

Review 2 related changes.

7555725

githubsgi requested review from allenwang28, ebsmothers, joecummings and pbontrager as code owners October 10, 2025 21:54

tianyu-l requested changes Oct 12, 2025

View reviewed changes

wwwjn and others added 12 commits October 13, 2025 12:54

Disable FlexAttention max-autotune when deterministic is used (pytorc…

1a73910

…h#1808) With max-autotune, FlexAttention is not deterministic even if torch.use_deterministic_algorithms is True. When deterministic mode is set, we should also remove the usage of `max-autotune`.

Fix num of layers for deepseek-v3 (pytorch#1845)

bc0383d

Fix the number of layer issue introduced by pytorch#1804

Add support for AC budget API (pytorch#1731)

fd224b3

Inspired by the blogpost: https://pytorch.org/blog/activation-checkpointing-techniques/

refactor TrainSpec to remove the name field (pytorch#1850)

623c565

[VLM] Update config import from Llama 3 to support scaled RoPE args (p…

a287903

…ytorch#1849) A test run on vlm debugmodel: <img width="1109" height="491" alt="image" src="https://github.com/user-attachments/assets/5763aa67-946e-4ab3-9ce8-e884fa3d1776" />

tushar00jain and others added 9 commits October 13, 2025 13:01

[vlm] Add light-weight CI for experimental models (pytorch#1848)

f7f225e

Add one light-weight CI for VLM

add owners and CI status for experiments (pytorch#1859)

b42af54

Next is step is to move `qwen3` and `llama4` to core, and remove outdated experiments.

Graduate qwen3 from experiment to core (pytorch#1860)

f74c1e8

As titled. Added CI for test, fix minor TP issue after adding attention_mask

Review related updates.

5921ea1

Rebasing and adding MATH attention kernel.

93091be

githubsgi added 4 commits October 13, 2025 13:40

Indent issue fix.

e5764b7

Merge branch 'main' into deterministic02

5ff665a

Removing ipex.

4629dc4

Removing redundant line.

fb61939

tianyu-l reviewed Oct 13, 2025

View reviewed changes

torchtitan/models/attention.py Show resolved Hide resolved

torchtitan/config/job_config.py Outdated Show resolved Hide resolved

torchtitan/experiments/flux/train.py Outdated Show resolved Hide resolved

torchtitan/experiments/forge/engine.py Outdated Show resolved Hide resolved

Review updates.

2a98386

tianyu-l approved these changes Oct 14, 2025

View reviewed changes

Fixing linter error.

e52a3a2

Second version of degub/deterministic configs. #1807

Are you sure you want to change the base?

Second version of degub/deterministic configs. #1807

Conversation

githubsgi commented Oct 7, 2025

Uh oh!

githubsgi commented Oct 7, 2025

Uh oh!

tianyu-l left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tianyu-l left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tianyu-l Oct 12, 2025

Choose a reason for hiding this comment

Uh oh!

tianyu-l Oct 12, 2025

Choose a reason for hiding this comment

Uh oh!

githubsgi commented Oct 13, 2025

Uh oh!

githubsgi commented Oct 13, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tianyu-l left a comment

Choose a reason for hiding this comment

Uh oh!

tianyu-l commented Oct 14, 2025

Uh oh!

tianyu-l commented Oct 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

11 participants