Skip to content

Commit 3b76b93

Browse files
committed
Update base for Update on "[cp][flex_attention] integration test trial"
Pull-Request-resolved: #1160 [ghstack-poisoned]
2 parents 29a67ec + 0b44d4c commit 3b76b93

File tree

93 files changed

+2571
-591
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

93 files changed

+2571
-591
lines changed

.github/CODEOWNERS

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
# This is a CODEOWNERS file.
2+
# Each line is a file pattern followed by one or more owners.
3+
4+
# These owners will be the default owners for everything in
5+
# the repo. Unless a later match takes precedence,
6+
# they will be requested for review when someone opens a pull request.
7+
* @tianyu-l @fegin @wwwjn
8+
9+
# Exclude the experiments directory by adding a pattern without owners
10+
/torchtitan/experiments/

.github/workflows/integration_test_8gpu.yaml

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,10 +3,15 @@ name: 8 GPU Integration Test
33
on:
44
push:
55
branches: [ main ]
6+
paths-ignore:
7+
- 'torchtitan/experiments/**'
68
pull_request:
9+
paths-ignore:
10+
- 'torchtitan/experiments/**'
711
schedule:
812
# Runs every 6 hours
913
- cron: '0 */6 * * *'
14+
1015
concurrency:
1116
group: unit-test${{ github.workflow }}-${{ github.ref == 'refs/heads/main' && github.run_number || github.ref }}
1217
cancel-in-progress: true
@@ -17,7 +22,7 @@ defaults:
1722

1823
jobs:
1924
build-test:
20-
uses: pytorch/test-infra/.github/workflows/linux_job.yml@main
25+
uses: pytorch/test-infra/.github/workflows/linux_job_v2.yml@main
2126
with:
2227
runner: linux.g5.48xlarge.nvidia.gpu
2328
gpu-arch-type: cuda
@@ -38,5 +43,7 @@ jobs:
3843
3944
python -m pip install --force-reinstall --pre torch --index-url https://download.pytorch.org/whl/nightly/cu126
4045
46+
USE_CPP=0 python -m pip install --pre torchao --index-url https://download.pytorch.org/whl/nightly/cu126
47+
4148
mkdir artifacts-to-be-uploaded
4249
python ./tests/integration_tests.py artifacts-to-be-uploaded --ngpu 8

.github/workflows/integration_test_8gpu_flux.yaml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -31,6 +31,7 @@ jobs:
3131
docker-image: torchtitan-ubuntu-20.04-clang12
3232
repository: pytorch/torchtitan
3333
upload-artifact: outputs
34+
# delete the checkpoints in the artifacts to save CI uploading time
3435
script: |
3536
set -eux
3637
@@ -44,3 +45,4 @@ jobs:
4445
4546
mkdir artifacts-to-be-uploaded
4647
python -m torchtitan.experiments.flux.tests.integration_tests artifacts-to-be-uploaded --ngpu 8
48+
rm -rf artifacts-to-be-uploaded/*/checkpoint
Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,49 @@
1+
name: 8 GPU Integration Test on H100
2+
3+
on:
4+
push:
5+
branches: [ main ]
6+
paths-ignore:
7+
- 'torchtitan/experiments/**'
8+
pull_request:
9+
paths-ignore:
10+
- 'torchtitan/experiments/**'
11+
schedule:
12+
# Runs every 6 hours
13+
- cron: '0 */6 * * *'
14+
15+
concurrency:
16+
group: unit-test${{ github.workflow }}-${{ github.ref == 'refs/heads/main' && github.run_number || github.ref }}
17+
cancel-in-progress: true
18+
19+
defaults:
20+
run:
21+
shell: bash -l -eo pipefail {0}
22+
23+
jobs:
24+
build-test:
25+
uses: pytorch/test-infra/.github/workflows/linux_job.yml@main
26+
with:
27+
runner: linux.aws.h100.8
28+
gpu-arch-type: cuda
29+
gpu-arch-version: "12.6"
30+
# This image is faster to clone than the default, but it lacks CC needed by triton
31+
# (1m25s vs 2m37s).
32+
docker-image: torchtitan-ubuntu-20.04-clang12
33+
repository: pytorch/torchtitan
34+
upload-artifact: outputs
35+
script: |
36+
set -eux
37+
38+
# The generic Linux job chooses to use base env, not the one setup by the image
39+
CONDA_ENV=$(conda env list --json | jq -r ".envs | .[-1]")
40+
conda activate "${CONDA_ENV}"
41+
42+
pip config --user set global.progress_bar off
43+
44+
python -m pip install --force-reinstall --pre torch --index-url https://download.pytorch.org/whl/nightly/cu126
45+
46+
USE_CPP=0 python -m pip install --pre torchao --index-url https://download.pytorch.org/whl/nightly/cu126
47+
48+
mkdir artifacts-to-be-uploaded
49+
python ./tests/integration_tests_h100.py artifacts-to-be-uploaded --ngpu 8
Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,46 @@
1+
name: SimpleFSDP 8 GPU Integration Test
2+
3+
on:
4+
push:
5+
branches: [ main ]
6+
paths:
7+
- 'torchtitan/experiments/simple_fsdp/**'
8+
pull_request:
9+
paths:
10+
- 'torchtitan/experiments/simple_fsdp/**'
11+
schedule:
12+
# Runs every 6 hours
13+
- cron: '0 */6 * * *'
14+
concurrency:
15+
group: unit-test${{ github.workflow }}-${{ github.ref == 'refs/heads/main' && github.run_number || github.ref }}
16+
cancel-in-progress: true
17+
18+
defaults:
19+
run:
20+
shell: bash -l -eo pipefail {0}
21+
22+
jobs:
23+
build-test:
24+
uses: pytorch/test-infra/.github/workflows/linux_job.yml@main
25+
with:
26+
runner: linux.g5.48xlarge.nvidia.gpu
27+
gpu-arch-type: cuda
28+
gpu-arch-version: "12.6"
29+
# This image is faster to clone than the default, but it lacks CC needed by triton
30+
# (1m25s vs 2m37s).
31+
docker-image: torchtitan-ubuntu-20.04-clang12
32+
repository: pytorch/torchtitan
33+
upload-artifact: outputs
34+
script: |
35+
set -eux
36+
37+
# The generic Linux job chooses to use base env, not the one setup by the image
38+
CONDA_ENV=$(conda env list --json | jq -r ".envs | .[-1]")
39+
conda activate "${CONDA_ENV}"
40+
41+
pip config --user set global.progress_bar off
42+
43+
python -m pip install --force-reinstall --pre torch --index-url https://download.pytorch.org/whl/nightly/cu126
44+
45+
mkdir artifacts-to-be-uploaded
46+
python -m torchtitan.experiments.simple_fsdp.tests.integration_tests artifacts-to-be-uploaded --ngpu 8

.github/workflows/release.yml

Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,49 @@
1+
# mostly borrowed from https://docs.github.com/en/actions/security-for-github-actions/security-hardening-your-deployments/configuring-openid-connect-in-pypi
2+
3+
name: Publish a Release to PyPI
4+
5+
on:
6+
release:
7+
types: [published]
8+
9+
jobs:
10+
release-build:
11+
runs-on: ubuntu-latest
12+
13+
steps:
14+
- uses: actions/checkout@v4
15+
16+
- uses: actions/setup-python@v5
17+
with:
18+
python-version: "3.x"
19+
20+
- name: build release distributions
21+
run: |
22+
python -m pip install build
23+
python -m build
24+
25+
- name: upload windows dists
26+
uses: actions/upload-artifact@v4
27+
with:
28+
name: release-dists
29+
path: dist/
30+
31+
pypi-publish:
32+
runs-on: ubuntu-latest
33+
needs:
34+
- release-build
35+
permissions:
36+
id-token: write
37+
environment:
38+
name: release
39+
url: https://pypi.org/p/torchtitan
40+
41+
steps:
42+
- name: Retrieve release distributions
43+
uses: actions/download-artifact@v4
44+
with:
45+
name: release-dists
46+
path: dist/
47+
48+
- name: Publish release distributions to PyPI
49+
uses: pypa/gh-action-pypi-publish@release/v1

.github/workflows/unit_test_cpu.yaml

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,11 @@ name: CPU Unit Test
33
on:
44
push:
55
branches: [ main ]
6+
paths-ignore:
7+
- 'torchtitan/experiments/**'
68
pull_request:
9+
paths-ignore:
10+
- 'torchtitan/experiments/**'
711

812
concurrency:
913
group: unit-test${{ github.workflow }}-${{ github.ref == 'refs/heads/main' && github.run_number || github.ref }}
@@ -25,4 +29,7 @@ jobs:
2529
pip config --user set global.progress_bar off
2630
2731
pip install --force-reinstall --pre torch --index-url https://download.pytorch.org/whl/nightly/cpu
32+
33+
USE_CPP=0 python -m pip install --pre torchao --index-url https://download.pytorch.org/whl/nightly/cpu
34+
2835
pytest tests/unit_tests --cov=. --cov-report=xml --durations=20 -vv

README.md

Lines changed: 12 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@ To use the latest features of `torchtitan`, we recommend using the most recent P
1717

1818

1919
## Latest News
20-
- [2025/04] Our paper has been accepted by [ICLR 2025](https://iclr.cc/virtual/2025/poster/29620). The poster will be presented on Friday April 25th.
20+
- [2025/04] Our paper was accepted by [ICLR 2025](https://iclr.cc/virtual/2025/poster/29620).
2121
- [2025/04] [Llama 4](torchtitan/experiments/llama4/) initial support is available as an experiment.
2222
- [2025/04] Training the diffusion model [FLUX](torchtitan/experiments/flux/) with FSDP/HSDP is available as an experiment.
2323
- [2025/04] The frontend implementation of [SimpleFSDP](torchtitan/experiments/simple_fsdp/), a compiler-based FSDP framework, is available as an experiment.
@@ -60,27 +60,28 @@ To accelerate contributions to and innovations around torchtitan, we are hosting
6060
7. DDP and HSDP
6161
8. [TorchFT](https://github.com/pytorch/torchft) integration
6262
9. Checkpointable data-loading, with the C4 dataset pre-configured (144M entries) and support for [custom datasets](docs/datasets.md)
63-
10. Flexible learning rate scheduler (warmup-stable-decay)
64-
11. Loss, GPU memory, throughput (tokens/sec), TFLOPs, and MFU displayed and logged via [Tensorboard or Weights & Biases](/docs/metrics.md)
65-
12. [Debugging tools](docs/debugging.md) including CPU/GPU profiling, memory profiling, Flight Recorder, etc.
66-
13. All options easily configured via [toml files](torchtitan/models/llama3/train_configs/)
67-
14. [Helper scripts](scripts/) to
63+
10. Gradient accumulation, enabled by giving an additional `--training.global_batch_size` argument in configuration
64+
11. Flexible learning rate scheduler (warmup-stable-decay)
65+
12. Loss, GPU memory, throughput (tokens/sec), TFLOPs, and MFU displayed and logged via [Tensorboard or Weights & Biases](/docs/metrics.md)
66+
13. [Debugging tools](docs/debugging.md) including CPU/GPU profiling, memory profiling, Flight Recorder, etc.
67+
14. All options easily configured via [toml files](torchtitan/models/llama3/train_configs/)
68+
15. [Helper scripts](scripts/) to
6869
- download tokenizers from Hugging Face
6970
- convert original Llama 3 checkpoints into the expected DCP format
7071
- estimate FSDP/HSDP memory usage without materializing the model
7172
- run distributed inference with Tensor Parallel
7273

73-
We report [performance](docs/performance.md) on up to 512 GPUs, and verify [loss converging](docs/converging.md) correctness of various techniques.
74+
We report [performance](benchmarks/llama3_h100_202412_torchtitan.md) on up to 512 GPUs, and verify [loss converging](docs/converging.md) correctness of various techniques.
7475

7576
### Dive into the code
7677

7778
You may want to see how the model is defined or how parallelism techniques are applied. For a guided tour, see these files first:
7879
* [torchtitan/train.py](torchtitan/train.py) - the main training loop and high-level setup code
79-
* [torchtitan/models/llama3/model.py](torchtitan/models/llama3/model.py) - the Llama 3.1 model definition
80-
* [torchtitan/models/llama3/parallelize_llama.py](torchtitan/models/llama3/parallelize_llama.py) - helpers for applying Data Parallel, Tensor Parallel, activation checkpointing, and `torch.compile` to the model
81-
* [torchtitan/models/llama3/pipeline_llama.py](torchtitan/models/llama3/pipeline_llama.py) - helpers for applying Pipeline Parallel to the model
80+
* [torchtitan/models/llama3/model/model.py](torchtitan/models/llama3/model/model.py) - the Llama 3.1 model definition
81+
* [torchtitan/models/llama3/infra/parallelize.py](torchtitan/models/llama3/infra/parallelize.py) - helpers for applying Data Parallel, Tensor Parallel, activation checkpointing, and `torch.compile` to the model
82+
* [torchtitan/models/llama3/infra/pipeline.py](torchtitan/models/llama3/infra/pipeline.py) - helpers for applying Pipeline Parallel to the model
8283
* [torchtitan/components/checkpoint.py](torchtitan/components/checkpoint.py) - utils for saving/loading distributed checkpoints
83-
* [torchtitan/components/float8.py](torchtitan/components/float8.py) - utils for applying Float8 techniques
84+
* [torchtitan/components/quantization/float8.py](torchtitan/components/quantization/float8.py) - utils for applying Float8 techniques
8485

8586

8687
## Installation

assets/version.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
0.0.2
1+
0.1.0

benchmarks/README.md

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
We welcome the community to submit reproducible benchmarking results.
2+
3+
## Submission Guidelines
4+
5+
A submission should be a file / files including the following information
6+
7+
1. Entity, which could be your name, GitHub username, company, university, team, etc.
8+
2. The model or theme of benchmarking, e.g. Llama 3.1, Async TP.
9+
3. The hardware setup, including the types of GPUs, interconnections, etc.
10+
4. The actual performance report with training configs, e.g. via
11+
- `.toml` files / commandline arguments
12+
- complete configs, which can be found in the log with [`--print_args`](https://github.com/pytorch/torchtitan/blob/e7c0cae934df78d6e9c2835f42ff1f757dc3fddc/torchtitan/config_manager.py#L47) turned on (preferred as the default value not shown in `.toml` or specified in commandline could change from time to time)
13+
5. The versions and date/time of `torchtitan`, `torch`, `torchao`, or any relevant dependencies.
14+
6. Other notes which could help reproduce the results.
15+
16+
The name of the file should follow the format of
17+
```
18+
[model/theme]_[hardware]_[date/time]_[entity].md
19+
```
20+
For example, `llama3.1_h100_202412_pytorch.md`, `asynctp_256xh100_20250613_alice+bob.md`.
21+
22+
An example can be found at [llama3_h100_202412_torchtitan.md](./llama3_h100_202412_torchtitan.md).
Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,44 @@
1+
The following performance benchmarks were done by the PyTorch team in June 2025, to measure the performance improvements of async TP over the vanilla TP baseline.
2+
3+
### Models
4+
5+
Llama 3.1 8B, 70B
6+
7+
### Hardware
8+
9+
We ran our performance benchmarks on the [Grand Teton platform](https://engineering.fb.com/2022/10/18/open-source/ocp-summit-2022-grand-teton/), where
10+
- Each host has 8 NVIDIA H100 GPUs fully connected with NVLink.
11+
- Each H100 GPU is equipped with 96GB HBM2e with 2.4 TB/sec peak memory bandwidth.
12+
- Hosts are inter-connected with backend RDMA network with 400 Gb/s per GPU.
13+
- We used the default 500W power limit, although tuning it up to 700W TDP can potentially provide further speedups.
14+
15+
16+
### Results
17+
18+
Detailed performance results and training configurations can be found in the tables below:
19+
20+
#### Llama3 70b on 256 H100s with FSDP=32, TP=8, torch.compile, full AC, local batch size 16
21+
22+
| Quantization | Vanilla TP tokens/sec | Async TP tokens/sec | Async TP speedup
23+
| :---------------- | :---- | :---- | :--- |
24+
| None (bfloat16) | 597.3 | 652.4 | 1.09 |
25+
| float8 tensorwise | 809.8 | 942.4 | 1.16 |
26+
| float8 rowwise | 599.6 | 624.8 | 1.04 |
27+
28+
#### Llama3 8b on 64 H100s with FSDP=8, TP=8, torch.compile, per op SAC, local batch size 12
29+
30+
| Quantization | Vanilla TP tokens/sec | Async TP tokens/sec | Async TP speedup
31+
| :---------------- | :----- | :----- | :--- |
32+
| None (bfloat16) | 4378 | 4809.4 | 1.10 |
33+
| float8 tensorwise | 5078.1 | 5570.1 | 1.10 |
34+
| float8 rowwise | 3708.5 | 3914.9 | 1.06 |
35+
36+
**Note**: the low baseline performance of the vanilla TP float8 rowwise training is being addressed here: https://github.com/pytorch/torchtitan/issues/1207
37+
38+
### Versions and Dates
39+
40+
| repo | commit | date |
41+
| --- | --- | --- |
42+
| torch | [38410cf9](https://github.com/pytorch/pytorch/commit/38410cf9b57079f3360c1e79601973a01cb2588c) | 2025/06/14 |
43+
| torchao | [6243040](https://github.com/pytorch/ao/commit/6243040807b9ceee889a58cba8e68c5fc4e2ebd8) | 2024/06/13 |
44+
| torchtitan | [820504e](https://github.com/pytorch/torchtitan/commit/820504e20d1149fbf0b98c567af24c4b0433b22d) | 2024/06/13 |

docs/performance.md renamed to benchmarks/llama3_h100_202412_torchtitan.md

Lines changed: 11 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,21 @@
1+
The following performance benchmarks were done by the `torchtitan` team at the end of 2024 using the latest `torch`, `torchao`, and `torchtitan` versions.
2+
3+
### Models
4+
5+
Llama 3.1 8B, 70B, 405B
6+
17
We demonstrate the effectiveness of elastic distributed training using torchtitan, via experiments on Llama 3.1 8B, 70B, and 405B models, from 1D parallelism to 4D parallelism, at the scale from 8 GPUs to 512 GPUs.
28

9+
### Hardware
10+
311
We ran our performance benchmarks on the [Grand Teton platform](https://engineering.fb.com/2022/10/18/open-source/ocp-summit-2022-grand-teton/), where
412
- Each host has 8 NVIDIA H100 GPUs fully connected with NVLink.
513
- Each H100 GPU is equipped with 96GB HBM2e with 2.4 TB/sec peak memory bandwidth.
614
- Hosts are inter-connected with backend RDMA network with 400 Gb/s per GPU.
715
- We used the default 500W power limit, although tuning it up to 700W TDP can potentially provide further speedups.
816

17+
### Results
18+
919
We note that, throughout our experimentation, memory readings are stable across the whole training process[^1], whereas throughput numbers (TPS/GPU) are calculated and logged every 10 iterations, and always read at the (arbitrarily determined) 90th iteration.
1020

1121
We do not report Model FLOPS Utilization (MFU) because when Float8 is enabled (on `nn.Linear` modules), both BFLOAT16 Tensor Core and FP8 Tensor Core are involved in model training, but they have different peak FLOPS and the definition of MFU under such scenario is not well-defined. We note that the 1D Llama 3.1 8B model training on 8 or 128 H100 GPUs without Float8 achieves 33% to 39% MFU[^2] (with or without torch.compile, respectively).
@@ -58,8 +68,8 @@ We do not report Model FLOPS Utilization (MFU) because when Float8 is enabled (o
5868
| FSDP 2, CP 4 | 131072 | 31 | 77.1 |
5969
| FSDP 1, CP 8 | 262144 | 16 | 84.9 |
6070

71+
### Versions and Dates
6172

62-
#### Versions used for performance testing
6373
| repo | commit | date |
6474
| --- | --- | --- |
6575
| torch | [1963fc8](https://github.com/pytorch/pytorch/commit/1963fc83a1c32e162162e2414f78b043f0674bae) | 2024/12/23 |

0 commit comments

Comments
 (0)