Skip to content

Conversation

tushar00jain
Copy link
Contributor

@tushar00jain tushar00jain commented Oct 7, 2025

Summary:
Allows disabling the storage of checkpoints related to torchft.

Users don't really have to rely on any external storage. So it reduces set up time to get things up and running. Since we also don't really need model checkpoints when we have torchft. And if checkpoint storage has issues, this can work as a killswitch to completely disable the storage so it doesn't impact training.


Stack created with Sapling. Best reviewed with ReviewStack.

This was referenced Oct 7, 2025
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 7, 2025
@tushar00jain tushar00jain force-pushed the pr1810 branch 2 times, most recently from 6ffcdf6 to 4fbe143 Compare October 7, 2025 21:54
Copy link
Contributor

@fegin fegin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comments as other PRs, it is hard to review because this PR contains changes from the previous PRs. I would suggest that you use ghstack, which is the standard tool used by most PyTorch developers to stack PRs.

@tushar00jain tushar00jain force-pushed the pr1810 branch 2 times, most recently from 1c1c5a2 to 634d838 Compare October 8, 2025 17:44
Copy link
Contributor

@fegin fegin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make the option more specific to allow retrain the data and add some warning.

@tushar00jain tushar00jain force-pushed the pr1810 branch 3 times, most recently from a013c35 to 0beadec Compare October 8, 2025 18:46
@tushar00jain tushar00jain force-pushed the pr1810 branch 2 times, most recently from 22239d9 to 9333989 Compare October 8, 2025 19:09
tushar00jain added a commit that referenced this pull request Oct 8, 2025
Summary:
allow users to specify the profiler schedule

---
[//]: # (BEGIN SAPLING FOOTER)
Stack created with [Sapling](https://sapling-scm.com). Best reviewed
with
[ReviewStack](https://reviewstack.dev/pytorch/torchtitan/pull/1809).
* #1811
* #1810
* #1812
* __->__ #1809

Co-authored-by: Tushar Jain <[email protected]>
@tushar00jain tushar00jain requested a review from tianyu-l October 8, 2025 20:00
@tushar00jain tushar00jain requested a review from fegin October 8, 2025 20:06
tushar00jain added a commit that referenced this pull request Oct 10, 2025
Summary:
the script adds configuration options to run training locally with ft
enabled

---
[//]: # (BEGIN SAPLING FOOTER)
Stack created with [Sapling](https://sapling-scm.com). Best reviewed
with
[ReviewStack](https://reviewstack.dev/pytorch/torchtitan/pull/1812).
* #1840
* #1811
* #1810
* __->__ #1812
* #1809

---------

Co-authored-by: Tushar Jain <[email protected]>
Copy link
Contributor

@fegin fegin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may want to move TorchFT logic out of checkpointer in the future. A better design is that TorchFT has its own train.py which customizes Trainer to use two checkpointers, one for the regular checkpoint and another one for dataloader.

Summary:
Allows disabling the storage of checkpoints related to torchft.

Users don't really have to rely on any external storage. So it reduces set up time to get things up and running. Since we also don't really need model checkpoints when we have torchft. And if checkpoint storage has issues, this can work as a killswitch to completely disable the storage so it doesn't impact training.
@tianyu-l tianyu-l merged commit a82b77a into pytorch:main Oct 12, 2025
12 of 13 checks passed
githubsgi pushed a commit to githubsgi/torchtitan that referenced this pull request Oct 13, 2025
Summary:
allow users to specify the profiler schedule

---
[//]: # (BEGIN SAPLING FOOTER)
Stack created with [Sapling](https://sapling-scm.com). Best reviewed
with
[ReviewStack](https://reviewstack.dev/pytorch/torchtitan/pull/1809).
* pytorch#1811
* pytorch#1810
* pytorch#1812
* __->__ pytorch#1809

Co-authored-by: Tushar Jain <[email protected]>
githubsgi pushed a commit to githubsgi/torchtitan that referenced this pull request Oct 13, 2025
Summary:
the script adds configuration options to run training locally with ft
enabled

---
[//]: # (BEGIN SAPLING FOOTER)
Stack created with [Sapling](https://sapling-scm.com). Best reviewed
with
[ReviewStack](https://reviewstack.dev/pytorch/torchtitan/pull/1812).
* pytorch#1840
* pytorch#1811
* pytorch#1810
* __->__ pytorch#1812
* pytorch#1809

---------

Co-authored-by: Tushar Jain <[email protected]>
githubsgi pushed a commit to githubsgi/torchtitan that referenced this pull request Oct 13, 2025
Summary:
Allows disabling the storage of checkpoints related to torchft.

Users don't really have to rely on any external storage. So it reduces
set up time to get things up and running. Since we also don't really
need model checkpoints when we have torchft. And if checkpoint storage
has issues, this can work as a killswitch to completely disable the
storage so it doesn't impact training.

---
[//]: # (BEGIN SAPLING FOOTER)
Stack created with [Sapling](https://sapling-scm.com). Best reviewed
with
[ReviewStack](https://reviewstack.dev/pytorch/torchtitan/pull/1810).
* pytorch#1856
* pytorch#1811
* __->__ pytorch#1810

Co-authored-by: Tushar Jain <[email protected]>
githubsgi pushed a commit to githubsgi/torchtitan that referenced this pull request Oct 15, 2025
Summary:
the script adds configuration options to run training locally with ft
enabled

---
[//]: # (BEGIN SAPLING FOOTER)
Stack created with [Sapling](https://sapling-scm.com). Best reviewed
with
[ReviewStack](https://reviewstack.dev/pytorch/torchtitan/pull/1812).
* pytorch#1840
* pytorch#1811
* pytorch#1810
* __->__ pytorch#1812
* pytorch#1809

---------

Co-authored-by: Tushar Jain <[email protected]>
githubsgi pushed a commit to githubsgi/torchtitan that referenced this pull request Oct 16, 2025
Summary:
the script adds configuration options to run training locally with ft
enabled

---
[//]: # (BEGIN SAPLING FOOTER)
Stack created with [Sapling](https://sapling-scm.com). Best reviewed
with
[ReviewStack](https://reviewstack.dev/pytorch/torchtitan/pull/1812).
* pytorch#1840
* pytorch#1811
* pytorch#1810
* __->__ pytorch#1812
* pytorch#1809

---------

Co-authored-by: Tushar Jain <[email protected]>
githubsgi pushed a commit to githubsgi/torchtitan that referenced this pull request Oct 16, 2025
Summary:
the script adds configuration options to run training locally with ft
enabled

---
[//]: # (BEGIN SAPLING FOOTER)
Stack created with [Sapling](https://sapling-scm.com). Best reviewed
with
[ReviewStack](https://reviewstack.dev/pytorch/torchtitan/pull/1812).
* pytorch#1840
* pytorch#1811
* pytorch#1810
* __->__ pytorch#1812
* pytorch#1809

---------

Co-authored-by: Tushar Jain <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants