Skip to content

Conversation

tushar00jain
Copy link
Contributor

@tushar00jain tushar00jain commented Oct 7, 2025

Summary:
record the profile trace if the training process receives SIGABRT e.g. when Process Group watchdog aborts the process


Stack created with Sapling. Best reviewed with ReviewStack.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 7, 2025
@tushar00jain tushar00jain marked this pull request as draft October 7, 2025 21:41
@tushar00jain tushar00jain force-pushed the pr1811 branch 9 times, most recently from b4e489c to e1b5016 Compare October 8, 2025 19:09
tushar00jain added a commit that referenced this pull request Oct 8, 2025
Summary:
allow users to specify the profiler schedule

---
[//]: # (BEGIN SAPLING FOOTER)
Stack created with [Sapling](https://sapling-scm.com). Best reviewed
with
[ReviewStack](https://reviewstack.dev/pytorch/torchtitan/pull/1809).
* #1811
* #1810
* #1812
* __->__ #1809

Co-authored-by: Tushar Jain <[email protected]>
tushar00jain added a commit that referenced this pull request Oct 10, 2025
Summary:
the script adds configuration options to run training locally with ft
enabled

---
[//]: # (BEGIN SAPLING FOOTER)
Stack created with [Sapling](https://sapling-scm.com). Best reviewed
with
[ReviewStack](https://reviewstack.dev/pytorch/torchtitan/pull/1812).
* #1840
* #1811
* #1810
* __->__ #1812
* #1809

---------

Co-authored-by: Tushar Jain <[email protected]>
tianyu-l pushed a commit that referenced this pull request Oct 12, 2025
Summary:
Allows disabling the storage of checkpoints related to torchft.

Users don't really have to rely on any external storage. So it reduces
set up time to get things up and running. Since we also don't really
need model checkpoints when we have torchft. And if checkpoint storage
has issues, this can work as a killswitch to completely disable the
storage so it doesn't impact training.

---
[//]: # (BEGIN SAPLING FOOTER)
Stack created with [Sapling](https://sapling-scm.com). Best reviewed
with
[ReviewStack](https://reviewstack.dev/pytorch/torchtitan/pull/1810).
* #1856
* #1811
* __->__ #1810

Co-authored-by: Tushar Jain <[email protected]>
Summary:
record the profile trace if the training process receives SIGABRT e.g. when Process Group watchdog aborts the process
githubsgi pushed a commit to githubsgi/torchtitan that referenced this pull request Oct 13, 2025
Summary:
allow users to specify the profiler schedule

---
[//]: # (BEGIN SAPLING FOOTER)
Stack created with [Sapling](https://sapling-scm.com). Best reviewed
with
[ReviewStack](https://reviewstack.dev/pytorch/torchtitan/pull/1809).
* pytorch#1811
* pytorch#1810
* pytorch#1812
* __->__ pytorch#1809

Co-authored-by: Tushar Jain <[email protected]>
githubsgi pushed a commit to githubsgi/torchtitan that referenced this pull request Oct 13, 2025
Summary:
the script adds configuration options to run training locally with ft
enabled

---
[//]: # (BEGIN SAPLING FOOTER)
Stack created with [Sapling](https://sapling-scm.com). Best reviewed
with
[ReviewStack](https://reviewstack.dev/pytorch/torchtitan/pull/1812).
* pytorch#1840
* pytorch#1811
* pytorch#1810
* __->__ pytorch#1812
* pytorch#1809

---------

Co-authored-by: Tushar Jain <[email protected]>
githubsgi pushed a commit to githubsgi/torchtitan that referenced this pull request Oct 13, 2025
Summary:
Allows disabling the storage of checkpoints related to torchft.

Users don't really have to rely on any external storage. So it reduces
set up time to get things up and running. Since we also don't really
need model checkpoints when we have torchft. And if checkpoint storage
has issues, this can work as a killswitch to completely disable the
storage so it doesn't impact training.

---
[//]: # (BEGIN SAPLING FOOTER)
Stack created with [Sapling](https://sapling-scm.com). Best reviewed
with
[ReviewStack](https://reviewstack.dev/pytorch/torchtitan/pull/1810).
* pytorch#1856
* pytorch#1811
* __->__ pytorch#1810

Co-authored-by: Tushar Jain <[email protected]>
githubsgi pushed a commit to githubsgi/torchtitan that referenced this pull request Oct 15, 2025
Summary:
the script adds configuration options to run training locally with ft
enabled

---
[//]: # (BEGIN SAPLING FOOTER)
Stack created with [Sapling](https://sapling-scm.com). Best reviewed
with
[ReviewStack](https://reviewstack.dev/pytorch/torchtitan/pull/1812).
* pytorch#1840
* pytorch#1811
* pytorch#1810
* __->__ pytorch#1812
* pytorch#1809

---------

Co-authored-by: Tushar Jain <[email protected]>
githubsgi pushed a commit to githubsgi/torchtitan that referenced this pull request Oct 16, 2025
Summary:
the script adds configuration options to run training locally with ft
enabled

---
[//]: # (BEGIN SAPLING FOOTER)
Stack created with [Sapling](https://sapling-scm.com). Best reviewed
with
[ReviewStack](https://reviewstack.dev/pytorch/torchtitan/pull/1812).
* pytorch#1840
* pytorch#1811
* pytorch#1810
* __->__ pytorch#1812
* pytorch#1809

---------

Co-authored-by: Tushar Jain <[email protected]>
githubsgi pushed a commit to githubsgi/torchtitan that referenced this pull request Oct 16, 2025
Summary:
the script adds configuration options to run training locally with ft
enabled

---
[//]: # (BEGIN SAPLING FOOTER)
Stack created with [Sapling](https://sapling-scm.com). Best reviewed
with
[ReviewStack](https://reviewstack.dev/pytorch/torchtitan/pull/1812).
* pytorch#1840
* pytorch#1811
* pytorch#1810
* __->__ pytorch#1812
* pytorch#1809

---------

Co-authored-by: Tushar Jain <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant