Skip to content

Commit 867a2b0

Browse files
sayakpaulstevhliu
andauthored
[Hunyuan] add optimization related sections to the hunyuan dit docs. (huggingface#8402)
* optimizations to the hunyuan dit docs. * Apply suggestions from code review Co-authored-by: Steven Liu <[email protected]> * Update docs/source/en/api/pipelines/hunyuandit.md Co-authored-by: Steven Liu <[email protected]> --------- Co-authored-by: Steven Liu <[email protected]>
1 parent 98730c5 commit 867a2b0

File tree

1 file changed

+55
-1
lines changed

1 file changed

+55
-1
lines changed

docs/source/en/api/pipelines/hunyuandit.md

Lines changed: 55 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -28,11 +28,65 @@ HunyuanDiT has the following components:
2828
* It uses a diffusion transformer as the backbone
2929
* It combines two text encoders, a bilingual CLIP and a multilingual T5 encoder
3030

31+
<Tip>
3132

32-
## Memory optimization
33+
Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers.md) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading.md#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines.
34+
35+
</Tip>
36+
37+
## Optimization
38+
39+
You can optimize the pipeline's runtime and memory consumption with torch.compile and feed-forward chunking. To learn about other optimization methods, check out the [Speed up inference](../../optimization/fp16) and [Reduce memory usage](../../optimization/memory) guides.
40+
41+
### Inference
42+
43+
Use [`torch.compile`](https://huggingface.co/docs/diffusers/main/en/tutorials/fast_diffusion#torchcompile) to reduce the inference latency.
44+
45+
First, load the pipeline:
46+
47+
```python
48+
from diffusers import HunyuanDiTPipeline
49+
import torch
50+
51+
pipeline = HunyuanDiTPipeline.from_pretrained(
52+
"Tencent-Hunyuan/HunyuanDiT-Diffusers", torch_dtype=torch.float16
53+
).to("cuda")
54+
```
55+
56+
Then change the memory layout of the pipelines `transformer` and `vae` components to `torch.channels-last`:
57+
58+
```python
59+
pipeline.transformer.to(memory_format=torch.channels_last)
60+
pipeline.vae.to(memory_format=torch.channels_last)
61+
```
62+
63+
Finally, compile the components and run inference:
64+
65+
```python
66+
pipeline.transformer = torch.compile(pipeline.transformer, mode="max-autotune", fullgraph=True)
67+
pipeline.vae.decode = torch.compile(pipeline.vae.decode, mode="max-autotune", fullgraph=True)
68+
69+
image = pipeline(prompt="一个宇航员在骑马").images[0]
70+
```
71+
72+
The [benchmark](https://gist.github.com/sayakpaul/29d3a14905cfcbf611fe71ebd22e9b23) results on a 80GB A100 machine are:
73+
74+
```bash
75+
With torch.compile(): Average inference time: 12.470 seconds.
76+
Without torch.compile(): Average inference time: 20.570 seconds.
77+
```
78+
79+
### Memory optimization
3380

3481
By loading the T5 text encoder in 8 bits, you can run the pipeline in just under 6 GBs of GPU VRAM. Refer to [this script](https://gist.github.com/sayakpaul/3154605f6af05b98a41081aaba5ca43e) for details.
3582

83+
Furthermore, you can use the [`~HunyuanDiT2DModel.enable_forward_chunking`] method to reduce memory usage. Feed-forward chunking runs the feed-forward layers in a transformer block in a loop instead of all at once. This gives you a trade-off between memory consumption and inference runtime.
84+
85+
```diff
86+
+ pipeline.transformer.enable_forward_chunking(chunk_size=1, dim=1)
87+
```
88+
89+
3690
## HunyuanDiTPipeline
3791

3892
[[autodoc]] HunyuanDiTPipeline

0 commit comments

Comments
 (0)