Skip to content

Commit 4ece78c

Browse files
committed
Update on "[cp][flex_attention] integration test trial"
Pull-Request-resolved: #1160 [ghstack-poisoned]
2 parents 68e76cf + 3b76b93 commit 4ece78c

File tree

93 files changed

+2578
-592
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

93 files changed

+2578
-592
lines changed

.github/CODEOWNERS

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
# This is a CODEOWNERS file.
2+
# Each line is a file pattern followed by one or more owners.
3+
4+
# These owners will be the default owners for everything in
5+
# the repo. Unless a later match takes precedence,
6+
# they will be requested for review when someone opens a pull request.
7+
* @tianyu-l @fegin @wwwjn
8+
9+
# Exclude the experiments directory by adding a pattern without owners
10+
/torchtitan/experiments/

.github/workflows/integration_test_8gpu.yaml

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,10 +3,15 @@ name: 8 GPU Integration Test
33
on:
44
push:
55
branches: [ main ]
6+
paths-ignore:
7+
- 'torchtitan/experiments/**'
68
pull_request:
9+
paths-ignore:
10+
- 'torchtitan/experiments/**'
711
schedule:
812
# Runs every 6 hours
913
- cron: '0 */6 * * *'
14+
1015
concurrency:
1116
group: unit-test${{ github.workflow }}-${{ github.ref == 'refs/heads/main' && github.run_number || github.ref }}
1217
cancel-in-progress: true
@@ -17,7 +22,7 @@ defaults:
1722

1823
jobs:
1924
build-test:
20-
uses: pytorch/test-infra/.github/workflows/linux_job.yml@main
25+
uses: pytorch/test-infra/.github/workflows/linux_job_v2.yml@main
2126
with:
2227
runner: linux.g5.48xlarge.nvidia.gpu
2328
gpu-arch-type: cuda
@@ -38,5 +43,7 @@ jobs:
3843
3944
python -m pip install --force-reinstall --pre torch --index-url https://download.pytorch.org/whl/nightly/cu126
4045
46+
USE_CPP=0 python -m pip install --pre torchao --index-url https://download.pytorch.org/whl/nightly/cu126
47+
4148
mkdir artifacts-to-be-uploaded
4249
python ./tests/integration_tests.py artifacts-to-be-uploaded --ngpu 8

.github/workflows/integration_test_8gpu_flux.yaml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -31,6 +31,7 @@ jobs:
3131
docker-image: torchtitan-ubuntu-20.04-clang12
3232
repository: pytorch/torchtitan
3333
upload-artifact: outputs
34+
# delete the checkpoints in the artifacts to save CI uploading time
3435
script: |
3536
set -eux
3637
@@ -44,3 +45,4 @@ jobs:
4445
4546
mkdir artifacts-to-be-uploaded
4647
python -m torchtitan.experiments.flux.tests.integration_tests artifacts-to-be-uploaded --ngpu 8
48+
rm -rf artifacts-to-be-uploaded/*/checkpoint
Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,49 @@
1+
name: 8 GPU Integration Test on H100
2+
3+
on:
4+
push:
5+
branches: [ main ]
6+
paths-ignore:
7+
- 'torchtitan/experiments/**'
8+
pull_request:
9+
paths-ignore:
10+
- 'torchtitan/experiments/**'
11+
schedule:
12+
# Runs every 6 hours
13+
- cron: '0 */6 * * *'
14+
15+
concurrency:
16+
group: unit-test${{ github.workflow }}-${{ github.ref == 'refs/heads/main' && github.run_number || github.ref }}
17+
cancel-in-progress: true
18+
19+
defaults:
20+
run:
21+
shell: bash -l -eo pipefail {0}
22+
23+
jobs:
24+
build-test:
25+
uses: pytorch/test-infra/.github/workflows/linux_job.yml@main
26+
with:
27+
runner: linux.aws.h100.8
28+
gpu-arch-type: cuda
29+
gpu-arch-version: "12.6"
30+
# This image is faster to clone than the default, but it lacks CC needed by triton
31+
# (1m25s vs 2m37s).
32+
docker-image: torchtitan-ubuntu-20.04-clang12
33+
repository: pytorch/torchtitan
34+
upload-artifact: outputs
35+
script: |
36+
set -eux
37+
38+
# The generic Linux job chooses to use base env, not the one setup by the image
39+
CONDA_ENV=$(conda env list --json | jq -r ".envs | .[-1]")
40+
conda activate "${CONDA_ENV}"
41+
42+
pip config --user set global.progress_bar off
43+
44+
python -m pip install --force-reinstall --pre torch --index-url https://download.pytorch.org/whl/nightly/cu126
45+
46+
USE_CPP=0 python -m pip install --pre torchao --index-url https://download.pytorch.org/whl/nightly/cu126
47+
48+
mkdir artifacts-to-be-uploaded
49+
python ./tests/integration_tests_h100.py artifacts-to-be-uploaded --ngpu 8
Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,46 @@
1+
name: SimpleFSDP 8 GPU Integration Test
2+
3+
on:
4+
push:
5+
branches: [ main ]
6+
paths:
7+
- 'torchtitan/experiments/simple_fsdp/**'
8+
pull_request:
9+
paths:
10+
- 'torchtitan/experiments/simple_fsdp/**'
11+
schedule:
12+
# Runs every 6 hours
13+
- cron: '0 */6 * * *'
14+
concurrency:
15+
group: unit-test${{ github.workflow }}-${{ github.ref == 'refs/heads/main' && github.run_number || github.ref }}
16+
cancel-in-progress: true
17+
18+
defaults:
19+
run:
20+
shell: bash -l -eo pipefail {0}
21+
22+
jobs:
23+
build-test:
24+
uses: pytorch/test-infra/.github/workflows/linux_job.yml@main
25+
with:
26+
runner: linux.g5.48xlarge.nvidia.gpu
27+
gpu-arch-type: cuda
28+
gpu-arch-version: "12.6"
29+
# This image is faster to clone than the default, but it lacks CC needed by triton
30+
# (1m25s vs 2m37s).
31+
docker-image: torchtitan-ubuntu-20.04-clang12
32+
repository: pytorch/torchtitan
33+
upload-artifact: outputs
34+
script: |
35+
set -eux
36+
37+
# The generic Linux job chooses to use base env, not the one setup by the image
38+
CONDA_ENV=$(conda env list --json | jq -r ".envs | .[-1]")
39+
conda activate "${CONDA_ENV}"
40+
41+
pip config --user set global.progress_bar off
42+
43+
python -m pip install --force-reinstall --pre torch --index-url https://download.pytorch.org/whl/nightly/cu126
44+
45+
mkdir artifacts-to-be-uploaded
46+
python -m torchtitan.experiments.simple_fsdp.tests.integration_tests artifacts-to-be-uploaded --ngpu 8

.github/workflows/release.yml

Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,49 @@
1+
# mostly borrowed from https://docs.github.com/en/actions/security-for-github-actions/security-hardening-your-deployments/configuring-openid-connect-in-pypi
2+
3+
name: Publish a Release to PyPI
4+
5+
on:
6+
release:
7+
types: [published]
8+
9+
jobs:
10+
release-build:
11+
runs-on: ubuntu-latest
12+
13+
steps:
14+
- uses: actions/checkout@v4
15+
16+
- uses: actions/setup-python@v5
17+
with:
18+
python-version: "3.x"
19+
20+
- name: build release distributions
21+
run: |
22+
python -m pip install build
23+
python -m build
24+
25+
- name: upload windows dists
26+
uses: actions/upload-artifact@v4
27+
with:
28+
name: release-dists
29+
path: dist/
30+
31+
pypi-publish:
32+
runs-on: ubuntu-latest
33+
needs:
34+
- release-build
35+
permissions:
36+
id-token: write
37+
environment:
38+
name: release
39+
url: https://pypi.org/p/torchtitan
40+
41+
steps:
42+
- name: Retrieve release distributions
43+
uses: actions/download-artifact@v4
44+
with:
45+
name: release-dists
46+
path: dist/
47+
48+
- name: Publish release distributions to PyPI
49+
uses: pypa/gh-action-pypi-publish@release/v1

.github/workflows/unit_test_cpu.yaml

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,11 @@ name: CPU Unit Test
33
on:
44
push:
55
branches: [ main ]
6+
paths-ignore:
7+
- 'torchtitan/experiments/**'
68
pull_request:
9+
paths-ignore:
10+
- 'torchtitan/experiments/**'
711

812
concurrency:
913
group: unit-test${{ github.workflow }}-${{ github.ref == 'refs/heads/main' && github.run_number || github.ref }}
@@ -25,4 +29,7 @@ jobs:
2529
pip config --user set global.progress_bar off
2630
2731
pip install --force-reinstall --pre torch --index-url https://download.pytorch.org/whl/nightly/cpu
32+
33+
USE_CPP=0 python -m pip install --pre torchao --index-url https://download.pytorch.org/whl/nightly/cpu
34+
2835
pytest tests/unit_tests --cov=. --cov-report=xml --durations=20 -vv

README.md

Lines changed: 12 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@ To use the latest features of `torchtitan`, we recommend using the most recent P
1717

1818

1919
## Latest News
20-
- [2025/04] Our paper has been accepted by [ICLR 2025](https://iclr.cc/virtual/2025/poster/29620). The poster will be presented on Friday April 25th.
20+
- [2025/04] Our paper was accepted by [ICLR 2025](https://iclr.cc/virtual/2025/poster/29620).
2121
- [2025/04] [Llama 4](torchtitan/experiments/llama4/) initial support is available as an experiment.
2222
- [2025/04] Training the diffusion model [FLUX](torchtitan/experiments/flux/) with FSDP/HSDP is available as an experiment.
2323
- [2025/04] The frontend implementation of [SimpleFSDP](torchtitan/experiments/simple_fsdp/), a compiler-based FSDP framework, is available as an experiment.
@@ -60,27 +60,28 @@ To accelerate contributions to and innovations around torchtitan, we are hosting
6060
7. DDP and HSDP
6161
8. [TorchFT](https://github.com/pytorch/torchft) integration
6262
9. Checkpointable data-loading, with the C4 dataset pre-configured (144M entries) and support for [custom datasets](docs/datasets.md)
63-
10. Flexible learning rate scheduler (warmup-stable-decay)
64-
11. Loss, GPU memory, throughput (tokens/sec), TFLOPs, and MFU displayed and logged via [Tensorboard or Weights & Biases](/docs/metrics.md)
65-
12. [Debugging tools](docs/debugging.md) including CPU/GPU profiling, memory profiling, Flight Recorder, etc.
66-
13. All options easily configured via [toml files](torchtitan/models/llama3/train_configs/)
67-
14. [Helper scripts](scripts/) to
63+
10. Gradient accumulation, enabled by giving an additional `--training.global_batch_size` argument in configuration
64+
11. Flexible learning rate scheduler (warmup-stable-decay)
65+
12. Loss, GPU memory, throughput (tokens/sec), TFLOPs, and MFU displayed and logged via [Tensorboard or Weights & Biases](/docs/metrics.md)
66+
13. [Debugging tools](docs/debugging.md) including CPU/GPU profiling, memory profiling, Flight Recorder, etc.
67+
14. All options easily configured via [toml files](torchtitan/models/llama3/train_configs/)
68+
15. [Helper scripts](scripts/) to
6869
- download tokenizers from Hugging Face
6970
- convert original Llama 3 checkpoints into the expected DCP format
7071
- estimate FSDP/HSDP memory usage without materializing the model
7172
- run distributed inference with Tensor Parallel
7273

73-
We report [performance](docs/performance.md) on up to 512 GPUs, and verify [loss converging](docs/converging.md) correctness of various techniques.
74+
We report [performance](benchmarks/llama3_h100_202412_torchtitan.md) on up to 512 GPUs, and verify [loss converging](docs/converging.md) correctness of various techniques.
7475

7576
### Dive into the code
7677

7778
You may want to see how the model is defined or how parallelism techniques are applied. For a guided tour, see these files first:
7879
* [torchtitan/train.py](torchtitan/train.py) - the main training loop and high-level setup code
79-
* [torchtitan/models/llama3/model.py](torchtitan/models/llama3/model.py) - the Llama 3.1 model definition
80-
* [torchtitan/models/llama3/parallelize_llama.py](torchtitan/models/llama3/parallelize_llama.py) - helpers for applying Data Parallel, Tensor Parallel, activation checkpointing, and `torch.compile` to the model
81-
* [torchtitan/models/llama3/pipeline_llama.py](torchtitan/models/llama3/pipeline_llama.py) - helpers for applying Pipeline Parallel to the model
80+
* [torchtitan/models/llama3/model/model.py](torchtitan/models/llama3/model/model.py) - the Llama 3.1 model definition
81+
* [torchtitan/models/llama3/infra/parallelize.py](torchtitan/models/llama3/infra/parallelize.py) - helpers for applying Data Parallel, Tensor Parallel, activation checkpointing, and `torch.compile` to the model
82+
* [torchtitan/models/llama3/infra/pipeline.py](torchtitan/models/llama3/infra/pipeline.py) - helpers for applying Pipeline Parallel to the model
8283
* [torchtitan/components/checkpoint.py](torchtitan/components/checkpoint.py) - utils for saving/loading distributed checkpoints
83-
* [torchtitan/components/float8.py](torchtitan/components/float8.py) - utils for applying Float8 techniques
84+
* [torchtitan/components/quantization/float8.py](torchtitan/components/quantization/float8.py) - utils for applying Float8 techniques
8485

8586

8687
## Installation

assets/version.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
0.0.2
1+
0.1.0

benchmarks/README.md

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
We welcome the community to submit reproducible benchmarking results.
2+
3+
## Submission Guidelines
4+
5+
A submission should be a file / files including the following information
6+
7+
1. Entity, which could be your name, GitHub username, company, university, team, etc.
8+
2. The model or theme of benchmarking, e.g. Llama 3.1, Async TP.
9+
3. The hardware setup, including the types of GPUs, interconnections, etc.
10+
4. The actual performance report with training configs, e.g. via
11+
- `.toml` files / commandline arguments
12+
- complete configs, which can be found in the log with [`--print_args`](https://github.com/pytorch/torchtitan/blob/e7c0cae934df78d6e9c2835f42ff1f757dc3fddc/torchtitan/config_manager.py#L47) turned on (preferred as the default value not shown in `.toml` or specified in commandline could change from time to time)
13+
5. The versions and date/time of `torchtitan`, `torch`, `torchao`, or any relevant dependencies.
14+
6. Other notes which could help reproduce the results.
15+
16+
The name of the file should follow the format of
17+
```
18+
[model/theme]_[hardware]_[date/time]_[entity].md
19+
```
20+
For example, `llama3.1_h100_202412_pytorch.md`, `asynctp_256xh100_20250613_alice+bob.md`.
21+
22+
An example can be found at [llama3_h100_202412_torchtitan.md](./llama3_h100_202412_torchtitan.md).

0 commit comments

Comments
 (0)