add auto_eager_graph_pass #1813

ruisizhang123 · 2025-10-07T22:14:46Z

This pr adds the autobucketing pass at aten-level to simplefsdp. It runs autobucketing + aot_eager backend without inductor. The aten fx autobucketing pass can be find in this PR: pytorch/pytorch#163960.

Key updates are:

Support customized aot_eger_autobucketing backend to perform autobucketing optimization.
In simplefsdp, the model_backend can be replaced by user's customized passes using compile.model_backend_override.

bdhirsh · 2025-10-09T22:36:16Z

torchtitan/experiments/simple_fsdp/llama3/parallelize.py

    if job_config.compile.enable and "model" in job_config.compile.components:
        torch._inductor.config.reorder_for_peak_memory = False
-        model = torch.compile(model, backend=job_config.compile.backend, fullgraph=True)
+        from torch._dynamo.backends.common import aot_autograd as auto_autograd_backend


nit: auto_autograd_backend -> aot_autograd_backend

tianyu-l

Should we make this configurable, instead of always turning on? The default could be "turning on". But I think researchers may want to be able to play with the non-optimized version as well.

Also more documentation is needed, in the code. The flags look mysterious.

ruisizhang123 · 2025-10-09T23:08:35Z

Should we make this configurable, instead of always turning on? The default could be "turning on". But I think researchers may want to be able to play with the non-optimized version as well.

Also more documentation is needed, in the code. The flags look mysterious.

yesss, it's not ready for review yet...

tianyu-l · 2025-10-10T01:11:37Z

torchtitan/experiments/simple_fsdp/simple_fsdp.py

    return model
+
+
+def get_compile_backend(backend_name: str) -> Union[str, callable]:


Given the complexity, I think we should start putting things to separate files. E.g. this function can go to compile_utils.py or backend.py. The original file can stay as simple_fsdp.py for now. WDYT?

yes, this make sense to me.

tianyu-l · 2025-10-10T01:16:30Z

torchtitan/config/job_config.py

+    model_backend: str = "inductor"
+    loss_backend: str = "inductor"


I somehow feel it's not worth changing the global config yet, as people who don't use SimpleFSDP won't have the motivation to separate them.

For now you could extend JobConfig following https://github.com/pytorch/torchtitan/blob/main/docs/extension.md#extending-jobconfig
and define a new config, e.g. compile.model_backend_override which default to None so that you could set

backend=compile_config.model_backend_override or compile_config.backend,

tianyu-l · 2025-10-10T02:45:51Z

torchtitan/config/job_config.py

    """Which components to compile"""
    backend: str = "inductor"

+    simplefsdp_backend_override: str | None = None


we don't need to modify the config/job_config.py
here's an example https://github.com/pytorch/torchtitan/blob/main/torchtitan/experiments/flux/job_config.py

you'll have to do https://github.com/pytorch/torchtitan/blob/main/torchtitan/experiments/flux/train_configs/debug_model.toml#L53 for it to take effect for now

my bad, updated

torchtitan/experiments/simple_fsdp/job_config.py

torchtitan/experiments/simple_fsdp/simplefsdp_args.py

tianyu-l · 2025-10-10T08:01:14Z

torchtitan/experiments/simple_fsdp/tests/integration_tests.py

should we add a test for aot_eager_autobucketing?

I can add this test. It is dependent on this PR: pytorch/pytorch#165063. I will add once the pytorch one is merged.

torchtitan/experiments/simple_fsdp/simplefsdp_args.py

tianyu-l · 2025-10-14T06:57:56Z

could you rebase onto #1871 before merge?

) When the autobucketing pass is registered as aot_eager backend `fw_compiler` and `bw_compiler`, this pr ensures the tensors are all-gathers on "cpu/cuda" device instead of "meta" device. When we do `dist.all_gather_object`, it will create new bytestorage outside no_dispatch [here](https://github.com/pytorch/pytorch/blob/a2e2e1d8c026951baa345f0dd17668bd1718eda5/torch/distributed/distributed_c10d.py#L3303), which is on meta device. Thus, I updated the code to use `unset_fake_temporarily`, which would gather RealTensor from other ranks. It is needed to unblock the aot_eager+autobucketing pass in this [PR](pytorch/torchtitan#1813). Otherwise, I hit the error as follows: ```bash traceback : Traceback (most recent call last): File "/home/ruisizhang123/pytorch/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 358, in wrapper return f(*args, **kwargs) File "/home/ruisizhang123/torchtitan/torchtitan/train.py", line 607, in train self.train_step(data_iterator) ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^ File "/home/ruisizhang123/torchtitan/torchtitan/train.py", line 507, in train_step loss = self.forward_backward_step(input_dict, labels) File "/home/ruisizhang123/torchtitan/torchtitan/train.py", line 483, in forward_backward_step pred = model_parts[0](inputs, **extra_inputs, **extra_args) File "/home/ruisizhang123/pytorch/torch/_dynamo/eval_frame.py", line 418, in __call__ return super().__call__(*args, **kwargs) ~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^ File "/home/ruisizhang123/pytorch/torch/nn/modules/module.py", line 1784, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^ File "/home/ruisizhang123/pytorch/torch/nn/modules/module.py", line 1795, in _call_impl return forward_call(*args, **kwargs) File "/home/ruisizhang123/pytorch/torch/_dynamo/eval_frame.py", line 901, in compile_wrapper raise e.remove_dynamo_frames() from None # see TORCHDYNAMO_VERBOSE=1 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/ruisizhang123/pytorch/torch/_dynamo/output_graph.py", line 2359, in _call_user_compiler raise BackendCompilerFailed( self.compiler_fn, e, inspect.currentframe() ).with_traceback(e.__traceback__) from None File "/home/ruisizhang123/pytorch/torch/_dynamo/output_graph.py", line 2334, in _call_user_compiler compiled_fn = compiler_fn(gm, example_inputs) File "/home/ruisizhang123/pytorch/torch/_dynamo/repro/after_dynamo.py", line 156, in __call__ compiled_gm = compiler_fn(gm, example_inputs) File "/home/ruisizhang123/pytorch/torch/__init__.py", line 2441, in __call__ return self.compiler_fn(model_, inputs_, **self.kwargs) ~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/ruisizhang123/pytorch/torch/_dynamo/backends/common.py", line 117, in __call__ cg = aot_module_simplified(gm, example_inputs, **self.kwargs) File "/home/ruisizhang123/pytorch/torch/_functorch/aot_autograd.py", line 1100, in aot_module_simplified compiled_fn, _ = aot_stage2_compile( ~~~~~~~~~~~~~~~~~~^ aot_state, ^^^^^^^^^^ ...<4 lines>... inference_compiler, ^^^^^^^^^^^^^^^^^^^ ) ^ File "/home/ruisizhang123/pytorch/torch/_functorch/_aot_autograd/graph_compile.py", line 257, in aot_stage2_compile return aot_stage2_autograd(aot_state, aot_graph_capture) File "/home/ruisizhang123/pytorch/torch/_functorch/_aot_autograd/graph_compile.py", line 1696, in aot_stage2_autograd compiled_fw_func = aot_config.fw_compiler(fw_module, adjusted_flat_args) File "/home/ruisizhang123/torchtitan/torchtitan/experiments/simple_fsdp/backend.py", line 35, in aten_autobucketing_reordering_pass schedule_overlap_bucketing(gm) ~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^ File "/home/ruisizhang123/pytorch/torch/_inductor/fx_passes/overlap_scheduling.py", line 755, in schedule_overlap_bucketing ).run() ~~~^^ File "/home/ruisizhang123/pytorch/torch/_inductor/fx_passes/overlap_scheduling.py", line 358, in run self._align_compute_nodes_runtime_estimations_across_all_distributed_ranks() ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^ File "/home/ruisizhang123/pytorch/torch/_inductor/fx_passes/overlap_scheduling.py", line 337, in _align_compute_nodes_runtime_estimations_across_all_distributed_ranks dist.all_gather_object( ~~~~~~~~~~~~~~~~~~~~~~^ gathered_runtime_estimations, runtime_estimations, pg ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ) ^ File "/home/ruisizhang123/pytorch/torch/distributed/c10d_logger.py", line 82, in wrapper return func(*args, **kwargs) File "/home/ruisizhang123/pytorch/torch/distributed/distributed_c10d.py", line 3170, in all_gather_object input_tensor, local_size = _object_to_tensor(obj, current_device, group) ~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/ruisizhang123/pytorch/torch/distributed/distributed_c10d.py", line 3079, in _object_to_tensor byte_tensor = torch.ByteTensor(byte_storage).to(device) ~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^ torch._dynamo.exc.BackendCompilerFailed: backend='compiler_fn' raised: RuntimeError: Attempted to set the storage of a tensor on device "cpu" to a storage on different device "meta". This is no longer allowed; the devices must match. Set TORCHDYNAMO_VERBOSE=1 for the internal stack trace (please do this especially if you're reporting a bug to PyTorch). For even more developer context, set TORCH_LOGS="+dynamo" ``` Pull Request resolved: #165063 Approved by: https://github.com/eellison

This pr adds the autobucketing pass at aten-level to simplefsdp. It runs autobucketing + aot_eager backend without inductor. The aten fx autobucketing pass can be find in this PR: pytorch/pytorch#163960. Key updates are: 1. Support customized `aot_eger_autobucketing` backend to perform autobucketing optimization. 2. In simplefsdp, the model_backend can be replaced by user's customized passes using `compile.model_backend_override`.

…rch#165063) When the autobucketing pass is registered as aot_eager backend `fw_compiler` and `bw_compiler`, this pr ensures the tensors are all-gathers on "cpu/cuda" device instead of "meta" device. When we do `dist.all_gather_object`, it will create new bytestorage outside no_dispatch [here](https://github.com/pytorch/pytorch/blob/a2e2e1d8c026951baa345f0dd17668bd1718eda5/torch/distributed/distributed_c10d.py#L3303), which is on meta device. Thus, I updated the code to use `unset_fake_temporarily`, which would gather RealTensor from other ranks. It is needed to unblock the aot_eager+autobucketing pass in this [PR](pytorch/torchtitan#1813). Otherwise, I hit the error as follows: ```bash traceback : Traceback (most recent call last): File "/home/ruisizhang123/pytorch/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 358, in wrapper return f(*args, **kwargs) File "/home/ruisizhang123/torchtitan/torchtitan/train.py", line 607, in train self.train_step(data_iterator) ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^ File "/home/ruisizhang123/torchtitan/torchtitan/train.py", line 507, in train_step loss = self.forward_backward_step(input_dict, labels) File "/home/ruisizhang123/torchtitan/torchtitan/train.py", line 483, in forward_backward_step pred = model_parts[0](inputs, **extra_inputs, **extra_args) File "/home/ruisizhang123/pytorch/torch/_dynamo/eval_frame.py", line 418, in __call__ return super().__call__(*args, **kwargs) ~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^ File "/home/ruisizhang123/pytorch/torch/nn/modules/module.py", line 1784, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^ File "/home/ruisizhang123/pytorch/torch/nn/modules/module.py", line 1795, in _call_impl return forward_call(*args, **kwargs) File "/home/ruisizhang123/pytorch/torch/_dynamo/eval_frame.py", line 901, in compile_wrapper raise e.remove_dynamo_frames() from None # see TORCHDYNAMO_VERBOSE=1 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/ruisizhang123/pytorch/torch/_dynamo/output_graph.py", line 2359, in _call_user_compiler raise BackendCompilerFailed( self.compiler_fn, e, inspect.currentframe() ).with_traceback(e.__traceback__) from None File "/home/ruisizhang123/pytorch/torch/_dynamo/output_graph.py", line 2334, in _call_user_compiler compiled_fn = compiler_fn(gm, example_inputs) File "/home/ruisizhang123/pytorch/torch/_dynamo/repro/after_dynamo.py", line 156, in __call__ compiled_gm = compiler_fn(gm, example_inputs) File "/home/ruisizhang123/pytorch/torch/__init__.py", line 2441, in __call__ return self.compiler_fn(model_, inputs_, **self.kwargs) ~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/ruisizhang123/pytorch/torch/_dynamo/backends/common.py", line 117, in __call__ cg = aot_module_simplified(gm, example_inputs, **self.kwargs) File "/home/ruisizhang123/pytorch/torch/_functorch/aot_autograd.py", line 1100, in aot_module_simplified compiled_fn, _ = aot_stage2_compile( ~~~~~~~~~~~~~~~~~~^ aot_state, ^^^^^^^^^^ ...<4 lines>... inference_compiler, ^^^^^^^^^^^^^^^^^^^ ) ^ File "/home/ruisizhang123/pytorch/torch/_functorch/_aot_autograd/graph_compile.py", line 257, in aot_stage2_compile return aot_stage2_autograd(aot_state, aot_graph_capture) File "/home/ruisizhang123/pytorch/torch/_functorch/_aot_autograd/graph_compile.py", line 1696, in aot_stage2_autograd compiled_fw_func = aot_config.fw_compiler(fw_module, adjusted_flat_args) File "/home/ruisizhang123/torchtitan/torchtitan/experiments/simple_fsdp/backend.py", line 35, in aten_autobucketing_reordering_pass schedule_overlap_bucketing(gm) ~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^ File "/home/ruisizhang123/pytorch/torch/_inductor/fx_passes/overlap_scheduling.py", line 755, in schedule_overlap_bucketing ).run() ~~~^^ File "/home/ruisizhang123/pytorch/torch/_inductor/fx_passes/overlap_scheduling.py", line 358, in run self._align_compute_nodes_runtime_estimations_across_all_distributed_ranks() ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^ File "/home/ruisizhang123/pytorch/torch/_inductor/fx_passes/overlap_scheduling.py", line 337, in _align_compute_nodes_runtime_estimations_across_all_distributed_ranks dist.all_gather_object( ~~~~~~~~~~~~~~~~~~~~~~^ gathered_runtime_estimations, runtime_estimations, pg ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ) ^ File "/home/ruisizhang123/pytorch/torch/distributed/c10d_logger.py", line 82, in wrapper return func(*args, **kwargs) File "/home/ruisizhang123/pytorch/torch/distributed/distributed_c10d.py", line 3170, in all_gather_object input_tensor, local_size = _object_to_tensor(obj, current_device, group) ~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/ruisizhang123/pytorch/torch/distributed/distributed_c10d.py", line 3079, in _object_to_tensor byte_tensor = torch.ByteTensor(byte_storage).to(device) ~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^ torch._dynamo.exc.BackendCompilerFailed: backend='compiler_fn' raised: RuntimeError: Attempted to set the storage of a tensor on device "cpu" to a storage on different device "meta". This is no longer allowed; the devices must match. Set TORCHDYNAMO_VERBOSE=1 for the internal stack trace (please do this especially if you're reporting a bug to PyTorch). For even more developer context, set TORCH_LOGS="+dynamo" ``` Pull Request resolved: pytorch#165063 Approved by: https://github.com/eellison

This pr adds the autobucketing pass at aten-level to simplefsdp. It runs autobucketing + aot_eager backend without inductor. The aten fx autobucketing pass can be find in this PR: pytorch/pytorch#163960. Key updates are: 1. Support customized `aot_eger_autobucketing` backend to perform autobucketing optimization. 2. In simplefsdp, the model_backend can be replaced by user's customized passes using `compile.model_backend_override`.

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 7, 2025

ruisizhang123 force-pushed the ruisi/aot_eager_pass branch 4 times, most recently from 7be92c7 to 56049a8 Compare October 9, 2025 22:02

bdhirsh reviewed Oct 9, 2025

View reviewed changes

ruisizhang123 changed the title ~~[wip] add auto_eager_graph_pass~~ add auto_eager_graph_pass Oct 9, 2025

ruisizhang123 mentioned this pull request Oct 9, 2025

[autobucketing] aten autobucketing fix to enable aot_eager pass pytorch/pytorch#165063

Closed

tianyu-l requested changes Oct 9, 2025

View reviewed changes

ruisizhang123 force-pushed the ruisi/aot_eager_pass branch from 56049a8 to 2fa5702 Compare October 10, 2025 00:31

ruisizhang123 requested review from fegin, wconstab and wwwjn as code owners October 10, 2025 00:31

ruisizhang123 force-pushed the ruisi/aot_eager_pass branch from 2fa5702 to 7dd7471 Compare October 10, 2025 00:33

ruisizhang123 requested review from bdhirsh, eellison, ezyang and tianyu-l October 10, 2025 00:37

ruisizhang123 force-pushed the ruisi/aot_eager_pass branch 3 times, most recently from 2f7415e to 0d70c22 Compare October 10, 2025 01:09

tianyu-l requested changes Oct 10, 2025

View reviewed changes

ruisizhang123 force-pushed the ruisi/aot_eager_pass branch from 0d70c22 to 615abb3 Compare October 10, 2025 01:58

ruisizhang123 requested a review from tianyu-l October 10, 2025 01:59

tianyu-l reviewed Oct 10, 2025

View reviewed changes

ruisizhang123 force-pushed the ruisi/aot_eager_pass branch 4 times, most recently from 24a1e8b to 9717183 Compare October 10, 2025 05:07

ruisizhang123 requested a review from tianyu-l October 10, 2025 05:42

tianyu-l reviewed Oct 10, 2025

View reviewed changes

ruisizhang123 force-pushed the ruisi/aot_eager_pass branch from 9717183 to dc4486a Compare October 10, 2025 18:02

tianyu-l approved these changes Oct 12, 2025

View reviewed changes

ruisizhang123 force-pushed the ruisi/aot_eager_pass branch 2 times, most recently from 0c8187d to b9d3cf7 Compare October 13, 2025 16:37

add auto_eager_graph_pass

5c7dc96

ruisizhang123 force-pushed the ruisi/aot_eager_pass branch from b9d3cf7 to 5c7dc96 Compare October 14, 2025 21:05

ruisizhang123 merged commit d0e2545 into main Oct 14, 2025
4 of 5 checks passed

ruisizhang123 deleted the ruisi/aot_eager_pass branch October 14, 2025 21:16

		return model


		def get_compile_backend(backend_name: str) -> Union[str, callable]:

		model_backend: str = "inductor"
		loss_backend: str = "inductor"

add auto_eager_graph_pass #1813

add auto_eager_graph_pass #1813

Uh oh!

Conversation

ruisizhang123 commented Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tianyu-l left a comment

Choose a reason for hiding this comment

Uh oh!

ruisizhang123 commented Oct 9, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tianyu-l commented Oct 14, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ruisizhang123 commented Oct 7, 2025 •

edited

Loading