Adding logic for cleaning up FT checkpoints #1528

bentherien · 2025-08-05T03:25:00Z

When using semi-sync training, FT checkpoints can start to take up a considerable amount of storage and there is currectly no mechanism to clean them up. This PR adds:

An argument to specify the number of FT checkpoints to keep
Clean up functionality within CheckpointManager

For this initial PR, I made the decision to disable logging for this deletion as it creates too much output, but this is up for discussion.

tianyu-l

This PR seems to be submitted to a very old branch.
We've removed JobConfig from CheckpointManager signature to make it a useful util in general. Please rebase and respect that.

tianyu-l · 2025-08-05T04:54:50Z

torchtitan/config_manager.py

@@ -680,6 +680,11 @@ class FaultTolerance:
    This is only used when "semi_sync_method" is set.
    """

+    checkpoint_keep_latest_k: int = 0


why can't you use checkpoint.keep_latest_k?

When torchft is enabled there are two types of checkpoints:

full checkpoint (also without torchft this can be enabled)

per-replica checkpoint (specific for torchft per-step fault tolerance)

I believe this change should only affect 2, so it makes sense to keep it under the FaultTolerance dataclass

H-Huang · 2025-08-05T14:30:55Z

torchtitan/components/checkpoint.py

@@ -112,14 +112,19 @@ def purge_thread(purge_queue: queue.Queue):
            if isinstance(path, Terminate):
                return
            assert isinstance(path, str)
-            logger.info("Checkpointer is deleting %s.", path)
+
+            if not 'ft-replica' in path:


Can "ft-replica" be a variable instead and use that across the checks? Also it looks like there is a mispelling for the folder name since its "replicat" currently lol, can you update that?

torchtitan/torchtitan/components/checkpoint.py

Line 628 in 92bea07

return os.path.join(self.folder, f"ft-replicat-{self.ft_replica_id}")

Sure, I can do this

H-Huang · 2025-08-05T14:33:16Z

torchtitan/components/checkpoint.py

@@ -641,6 +647,7 @@ def _ft_save(self, step: int) -> None:
        self.save_future = self.dcp_save(
            self.ft_states, checkpoint_id=checkpoint_id, async_mode=AsyncMode.ASYNC
        )
+        self._purge_stale_ft_checkpoints()


do we need to call this here? I thought the purge thread do the deletion

Yes, but the directories to purge need to be added to the queue.

bentherien requested review from tianyu-l, fegin, wwwjn and wconstab as code owners August 5, 2025 03:25

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Aug 5, 2025

bentherien changed the title ~~added logic for cleaning up FT checkpoints~~ Adding logic for cleaning up FT checkpoints Aug 5, 2025

tianyu-l requested changes Aug 5, 2025

View reviewed changes

H-Huang reviewed Aug 5, 2025

View reviewed changes

H-Huang requested a review from tushar00jain August 5, 2025 14:33

btherien user added 3 commits August 5, 2025 19:22

added logic for cleaning up FT checkpoints

a28fd56

create queue when ft_keep_latest_k>0

eda0697

rebase and add single variable ft_dir

2818959

bentherien force-pushed the ft_checkpoint_cleanup branch from c19edff to 2818959 Compare August 5, 2025 19:28

bentherien requested a review from tianyu-l August 5, 2025 19:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Adding logic for cleaning up FT checkpoints #1528

Adding logic for cleaning up FT checkpoints #1528

Uh oh!

bentherien commented Aug 5, 2025

Uh oh!

tianyu-l left a comment

Uh oh!

tianyu-l Aug 5, 2025

Uh oh!

H-Huang Aug 5, 2025

Uh oh!

H-Huang Aug 5, 2025

Uh oh!

bentherien Aug 5, 2025

Uh oh!

H-Huang Aug 5, 2025

Uh oh!

bentherien Aug 5, 2025

Uh oh!

Uh oh!

Adding logic for cleaning up FT checkpoints #1528

Are you sure you want to change the base?

Adding logic for cleaning up FT checkpoints #1528

Uh oh!

Conversation

bentherien commented Aug 5, 2025

Uh oh!

tianyu-l left a comment

Choose a reason for hiding this comment

Uh oh!

tianyu-l Aug 5, 2025

Choose a reason for hiding this comment

Uh oh!

H-Huang Aug 5, 2025

Choose a reason for hiding this comment

Uh oh!

H-Huang Aug 5, 2025

Choose a reason for hiding this comment

Uh oh!

bentherien Aug 5, 2025

Choose a reason for hiding this comment

Uh oh!

H-Huang Aug 5, 2025

Choose a reason for hiding this comment

Uh oh!

bentherien Aug 5, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!