Closed
Description
Describe the bug
So I create the pipe and use it to generate multiple image with same settings. During first inference it take 8 min, next 30 min. VRAM usage remains the same.
Tested on 8 GB + 8 GB
P.S. I have used AuraFlow, Sana, Hunyuan, LTX, Cog, and several other pipeline but didn't encounter this issue with any of them.
Reproduction
import torch
from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig, FluxTransformer2DModel, FluxPipeline
from huggingface_hub import hf_hub_download
from transformers import T5EncoderModel
bfl_repo = "black-forest-labs/FLUX.1-dev"
dtype = torch.bfloat16
quantization_config = DiffusersBitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16)
transformer_4bit = FluxTransformer2DModel.from_pretrained(
bfl_repo,
subfolder="transformer",
quantization_config=quantization_config,
torch_dtype=torch.bfloat16,
)
text_encoder_2 = T5EncoderModel.from_pretrained(
bfl_repo,
subfolder="text_encoder_2",
quantization_config=quantization_config,
torch_dtype=dtype
)
pipe = FluxPipeline.from_pretrained(
bfl_repo,
transformer=None,
text_encoder_2=None,
torch_dtype=dtype
)
pipe.transformer = transformer_4bit
pipe.text_encoder_2 = text_encoder_2
# https://civitai.com/models/1111989/majicflus-beauty
pipe.load_lora_weights(
"./models/lora/flux_dev/majicbeauty1.safetensors",
adapter_name="majicbeauty1"
)
pipe.set_adapters("majicbeauty1", adapter_weights=0.8)
pipe.enable_model_cpu_offload()
pipe.vae.enable_tiling()
pipe.vae.enable_slicing()
prompt = "Photograph capturing a woman seated in a car, looking straight ahead. Her face is partially obscured, making her expression hard to read, adding an air of mystery. Natural light filters through the car window, casting subtle reflections and shadows on her face and the interior. The colors are muted yet realistic, with a slight grain that evokes a 1970s film quality. The scene feels intimate and contemplative, capturing a quiet, introspective moment, mj"
image = pipe(
prompt=prompt,
width=1072,
height=1920,
max_sequence_length=512,
num_inference_steps=40,
guidance_scale=50,
generator=torch.Generator().manual_seed(1349562290),
).images[0]
image.save("out_majicbeauty5.png")
torch.cuda.empty_cache()
image = pipe(
prompt=prompt,
width=1072,
height=1920,
max_sequence_length=512,
num_inference_steps=50,
guidance_scale=40,
generator=torch.Generator().manual_seed(1349562290),
).images[0]
image.save("out_majicbeauty6.png")
Logs
Fetching 3 files: 100%|█████████████████████████████████████████████████████| 3/3 [00:00<?, ?it/s]
`low_cpu_mem_usage` was None, now default to True since model is quantized.
Downloading shards: 100%|██████████████████████████████████████████| 2/2 [00:00<00:00, 440.05it/s]
Loading checkpoint shards: 100%|████████████████████████████████████| 2/2 [00:27<00:00, 13.90s/it]
Loading pipeline components...: 0%| | 0/5 [00:00<?, ?it/s]You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
Loading pipeline components...: 100%|███████████████████████████████| 5/5 [00:00<00:00, 5.12it/s]
Token indices sequence length is longer than the specified maximum sequence length for this model (95 > 77). Running this sequence through the model will result in indexing errors
The following part of your input was truncated because CLIP can only handle sequences up to 77 tokens: ['. the scene feels intimate and contemplative, capturing a quiet, introspective moment, mj']
100%|█████████████████████████████████████████████████████████████| 40/40 [08:10<00:00, 12.25s/it]
The following part of your input was truncated because CLIP can only handle sequences up to 77 tokens: ['. the scene feels intimate and contemplative, capturing a quiet, introspective moment, mj']
4%|██▍ | 2/50 [01:52<43:27, 54.32s/it]
System Info
- 🤗 Diffusers version: 0.33.0.dev0
- Platform: Windows-10-10.0.26100-SP0
- Running on Google Colab?: No
- Python version: 3.10.11
- PyTorch version (GPU?): 2.5.1+cu124 (True)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Huggingface_hub version: 0.27.1
- Transformers version: 4.48.1
- Accelerate version: 1.4.0.dev0
- PEFT version: 0.14.1.dev0
- Bitsandbytes version: 0.45.1
- Safetensors version: 0.5.2
- xFormers version: not installed
- Accelerator: NVIDIA GeForce RTX 4060 Laptop GPU, 8188 MiB
- Using GPU in script?:
- Using distributed or parallel set-up in script?: