[torch.compile] Make HiDream torch.compile ready #11477

sayakpaul · 2025-05-01T13:42:56Z

What does this PR do?

Trying to make the HiDream model fully compatible with torch.compile() but it fails with:
https://pastebin.com/EbCFqBvw

To reproduce run the following from a GPU machine:

RUN_COMPILE=1 RUN_SLOW=1 pytest tests/models/transformers/test_models_transformer_hidream.py -k "test_torch_compile_recompilation_and_graph_break"

I am on the following env:

- 🤗 Diffusers version: 0.34.0.dev0
- Platform: Linux-6.8.0-55-generic-x86_64-with-glibc2.39
- Running on Google Colab?: No
- Python version: 3.10.12
- PyTorch version (GPU?): 2.7.0+cu126 (True)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Huggingface_hub version: 0.30.2
- Transformers version: 4.51.3
- Accelerate version: 1.6.0.dev0
- PEFT version: 0.15.2.dev0
- Bitsandbytes version: 0.45.3
- Safetensors version: 0.5.3
- xFormers version: not installed
- Accelerator: NVIDIA GeForce RTX 4090, 24564 MiB
NVIDIA GeForce RTX 4090, 24564 MiB
- Using GPU in script?: <fill in>
- Using distributed or parallel set-up in script?: <fill in>

@anijain2305 @StrongerXi would you have any pointers?

HuggingFaceDocBuilderDev · 2025-05-01T13:49:52Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

sayakpaul · 2025-05-01T14:05:48Z

src/diffusers/models/transformers/transformer_hidream_image.py

-        tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0)
+        count_freq = torch.bincount(flat_expert_indices, minlength=self.num_activated_experts)
+        tokens_per_expert = count_freq.cumsum(dim=0)
+


Just reimplemented it to eliminate the numpy() dependency.

sayakpaul · 2025-05-01T14:06:37Z

tests/models/transformers/test_models_transformer_hidream.py

+    @require_torch_2
+    @is_torch_compile
+    @slow
+    def test_torch_compile_recompilation_and_graph_break(self):


Relevant test for this PR.

StrongerXi · 2025-05-01T17:42:58Z

The graph break seems to be induced by @torch.no_grad:

diffusers/src/diffusers/models/transformers/transformer_hidream_image.py

Line 388 in d0c0239

@torch.no_grad()

@anijain2305 is this known?

sayakpaul · 2025-05-03T05:02:04Z

The graph break seems to be induced by @torch.no_grad:

diffusers/src/diffusers/models/transformers/transformer_hidream_image.py

Line 388 in d0c0239

@torch.no_grad()

@anijain2305 is this known?

Even if we remove the decorator, it still fails with the same error.

anijain2305

LGTM

Edit - Checked the messages, missed that there is still a graph break. I can take a look today.

sayakpaul · 2025-05-08T13:52:17Z

Thanks! Appreciate it.

…

On Thu, 8 May 2025 at 7:06 PM, Animesh Jain ***@***.***> wrote: ***@***.**** approved this pull request. LGTM — Reply to this email directly, view it on GitHub <#11477 (review)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AFPE2TCL5CXY6KDTOMUTXM325NMWRAVCNFSM6AAAAAB4H4U5SOVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDQMRVGE3DSNZQHE> . You are receiving this because you authored the thread.Message ID: ***@***.***>

anijain2305 · 2025-05-13T22:22:29Z

Not a useful update. But there seems to be some dynamic shapes graph break here coming from moe_infer function.

cc @laithsakka

sayakpaul · 2025-05-14T15:51:53Z

@anijain2305

Okay I think I know why this is happening. The line that primarily causes this shape change is:

diffusers/src/diffusers/models/transformers/transformer_hidream_image.py

Line 897 in f4fa3be

    
           hidden_states = torch.cat([hidden_states, initial_encoder_hidden_states], dim=1)

This is why the moe_infer() function, when called with single_stream_blocks, complains about the shape changes.

So, I tried with dynamic=True along with

torch._dynamo.config.capture_dynamic_output_shape_ops = True
torch.fx.experimental._config.use_duck_shape = False

It then complains:

msg = 'dynamic shape operator: aten.bincount.default; Operator does not have a meta kernel that supports dynamic output shapes, please report an issue to PyTorch'

Keeping this open maybe for better tracking.

sayakpaul · 2025-05-17T08:16:33Z

Cc: @StrongerXi for the above observation too.

StrongerXi · 2025-06-05T22:36:07Z

On it.

StrongerXi · 2025-06-16T21:34:00Z

Okay I spent some time digging into the MOE stuff, here's what I learned:

HiDream has 2 branches in the MOE FFN layer, and looks like the moe_infer branch is meant to speed up inference as it explicitly skips the experts without tokens. However, that's really bad for torch.compile because (a). it creates a hard-to-resolve data dependency (the branching depends on output of torch.bincount which depends on the data of flat_expert_indices, and (b). even if we solve (a), we'd face lots of recompilations, because torch.compile would compile for each possible execution path (e.g., expert 1 & 3 firing, or expert 1, 2, 4 firing, etc.).

diffusers/src/diffusers/models/transformers/transformer_hidream_image.py

Lines 396 to 399 in dacae33

    
           for i, end_idx in enumerate(tokens_per_expert): 
        
               start_idx = 0 if i == 0 else tokens_per_expert[i - 1] 
        
               if start_idx == end_idx: 
        
                   continue

Then I did some benchmark, and turns out the moe_infer isn't faster than the "training branch", and they produce identical output images, and torch.compile produces much lower e2e latency using the "training branch":

diffusers/src/diffusers/models/transformers/transformer_hidream_image.py

Lines 375 to 382 in dacae33

    
           if self.training and not self._force_inference_output: 
        
               x = x.repeat_interleave(self.num_activated_experts, dim=0) 
        
               y = torch.empty_like(x, dtype=wtype) 
        
               for i, expert in enumerate(self.experts): 
        
                   y[flat_topk_idx == i] = expert(x[flat_topk_idx == i]).to(dtype=wtype) 
        
               y = (y.view(*topk_weight.shape, -1) * topk_weight.unsqueeze(-1)).sum(dim=1) 
        
               y = y.view(*orig_shape).to(dtype=wtype) 
        
               # y = AddAuxiliaryLoss.apply(y, aux_loss)

Then I just have to fix a small graph break here, where img_sizes is supposed to be a List[Tuple[int, int]] but got computed as tensors:

diffusers/src/diffusers/models/transformers/transformer_hidream_image.py

Line 686 in dacae33

    
           def unpatchify(self, x: torch.Tensor, img_sizes: List[Tuple[int, int]], is_training: bool) -> List[torch.Tensor]:

diffusers/src/diffusers/models/transformers/transformer_hidream_image.py

Lines 718 to 720 in dacae33

    
           # create img_sizes 
        
           img_sizes = torch.tensor([patch_height, patch_width], dtype=torch.int64, device=device).reshape(-1) 
        
           img_sizes = img_sizes.unsqueeze(0).repeat(batch_size, 1)

The fix is simple:

        # create img_sizes
        #img_sizes = torch.tensor([patch_height, patch_width], dtype=torch.int64, device=device).reshape(-1)
        #img_sizes = img_sizes.unsqueeze(0).repeat(batch_size, 1)
        img_sizes = [[patch_height, patch_width]] * batch_size

Here are the e2e pipeline benchmark results using the hidream demo script, and compiling the transformer:

# pytorch 2.7.1
#
# original eager:     26.6s, compiled 24.8s (fullgraph=False)
# train-branch eager: 25.9s, compiled 19.5s (fullgraph=True)

I also saw that ComfyUI uses the training branch too. So maybe we should just use the training branch in eager as well? Or we could add a torch.compiler.is_compiling() to use the training branch under compile only. What do you think @sayakpaul?

sayakpaul · 2025-06-17T04:05:10Z

Wow, this is terrific KT. Thanks, Ryan!

Or we could add a torch.compiler.is_compiling() to use the training branch under compile only.

This is a good approach and is worth adding. @yiyixuxu what are your thoughts?

sayakpaul added 3 commits May 1, 2025 18:04

add tests for hidream transformer model.

4caa6e8

fix

1d1e715

get hidream transformer fully torch.compile compatible.

6e3d988

Merge branch 'main' into hidream-torch-compile

6e6ccf7

sayakpaul commented May 1, 2025

View reviewed changes

joangava approved these changes May 2, 2025

View reviewed changes

Merge branch 'main' into hidream-torch-compile

612770b

anijain2305 approved these changes May 8, 2025

View reviewed changes

resolce conflicts.,

f9662ed

sayakpaul mentioned this pull request May 14, 2025

[tests] add torch.compile test for HiDreamImageTransformer2DModel #11552

Closed

6 tasks

Merge branch 'main' into hidream-torch-compile

4bac6ac

Merge branch 'main' into hidream-torch-compile

dacae33

sayakpaul added performance Anything related to performance improvements, profiling and benchmarking torch.compile labels Jun 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[torch.compile] Make HiDream torch.compile ready #11477

[torch.compile] Make HiDream torch.compile ready #11477

Uh oh!

sayakpaul commented May 1, 2025 •

edited

Loading

Uh oh!

HuggingFaceDocBuilderDev commented May 1, 2025

Uh oh!

sayakpaul May 1, 2025

Uh oh!

sayakpaul May 1, 2025

Uh oh!

StrongerXi commented May 1, 2025

Uh oh!

sayakpaul commented May 3, 2025

Uh oh!

anijain2305 left a comment •

edited

Loading

Uh oh!

sayakpaul commented May 8, 2025 via email

Uh oh!

anijain2305 commented May 13, 2025

Uh oh!

sayakpaul commented May 14, 2025

Uh oh!

sayakpaul commented May 17, 2025

Uh oh!

StrongerXi commented Jun 5, 2025

Uh oh!

StrongerXi commented Jun 16, 2025

Uh oh!

sayakpaul commented Jun 17, 2025

Uh oh!

Uh oh!

[torch.compile] Make HiDream torch.compile ready #11477

Are you sure you want to change the base?

[torch.compile] Make HiDream torch.compile ready #11477

Uh oh!

Conversation

sayakpaul commented May 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Uh oh!

HuggingFaceDocBuilderDev commented May 1, 2025

Uh oh!

sayakpaul May 1, 2025

Choose a reason for hiding this comment

Uh oh!

sayakpaul May 1, 2025

Choose a reason for hiding this comment

Uh oh!

StrongerXi commented May 1, 2025

Uh oh!

sayakpaul commented May 3, 2025

Uh oh!

anijain2305 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sayakpaul commented May 8, 2025 via email

Uh oh!

anijain2305 commented May 13, 2025

Uh oh!

sayakpaul commented May 14, 2025

Uh oh!

sayakpaul commented May 17, 2025

Uh oh!

StrongerXi commented Jun 5, 2025

Uh oh!

StrongerXi commented Jun 16, 2025

Uh oh!

sayakpaul commented Jun 17, 2025

Uh oh!

Uh oh!

sayakpaul commented May 1, 2025 •

edited

Loading

anijain2305 left a comment •

edited

Loading