Skip to content

Reusing the same pipeline (FluxPipeline) increase the inference duration #10705

Closed
@nitinmukesh

Description

@nitinmukesh

Describe the bug

So I create the pipe and use it to generate multiple image with same settings. During first inference it take 8 min, next 30 min. VRAM usage remains the same.

Tested on 8 GB + 8 GB

P.S. I have used AuraFlow, Sana, Hunyuan, LTX, Cog, and several other pipeline but didn't encounter this issue with any of them.

Reproduction

import torch
from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig, FluxTransformer2DModel, FluxPipeline
from huggingface_hub import hf_hub_download
from transformers import T5EncoderModel

bfl_repo = "black-forest-labs/FLUX.1-dev"
dtype = torch.bfloat16
quantization_config = DiffusersBitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16)

transformer_4bit = FluxTransformer2DModel.from_pretrained(
    bfl_repo,
    subfolder="transformer",
    quantization_config=quantization_config,
    torch_dtype=torch.bfloat16,
)
text_encoder_2 = T5EncoderModel.from_pretrained(
    bfl_repo, 
    subfolder="text_encoder_2",
    quantization_config=quantization_config,
    torch_dtype=dtype
)
pipe = FluxPipeline.from_pretrained(
    bfl_repo, 
    transformer=None, 
    text_encoder_2=None, 
    torch_dtype=dtype
)
pipe.transformer = transformer_4bit
pipe.text_encoder_2 = text_encoder_2

# https://civitai.com/models/1111989/majicflus-beauty
pipe.load_lora_weights(
    "./models/lora/flux_dev/majicbeauty1.safetensors", 
    adapter_name="majicbeauty1"
)

pipe.set_adapters("majicbeauty1", adapter_weights=0.8)
pipe.enable_model_cpu_offload()
pipe.vae.enable_tiling()
pipe.vae.enable_slicing()

prompt = "Photograph capturing a woman seated in a car, looking straight ahead. Her face is partially obscured, making her expression hard to read, adding an air of mystery. Natural light filters through the car window, casting subtle reflections and shadows on her face and the interior. The colors are muted yet realistic, with a slight grain that evokes a 1970s film quality. The scene feels intimate and contemplative, capturing a quiet, introspective moment, mj"
image = pipe(
    prompt=prompt,
    width=1072,
    height=1920,
    max_sequence_length=512,
    num_inference_steps=40,
    guidance_scale=50,
    generator=torch.Generator().manual_seed(1349562290),
).images[0]
image.save("out_majicbeauty5.png")
torch.cuda.empty_cache()
image = pipe(
    prompt=prompt,
    width=1072,
    height=1920,
    max_sequence_length=512,
    num_inference_steps=50,
    guidance_scale=40,
    generator=torch.Generator().manual_seed(1349562290),
).images[0]
image.save("out_majicbeauty6.png")

Logs

Fetching 3 files: 100%|█████████████████████████████████████████████████████| 3/3 [00:00<?, ?it/s]
`low_cpu_mem_usage` was None, now default to True since model is quantized.
Downloading shards: 100%|██████████████████████████████████████████| 2/2 [00:00<00:00, 440.05it/s]
Loading checkpoint shards: 100%|████████████████████████████████████| 2/2 [00:27<00:00, 13.90s/it]
Loading pipeline components...:   0%|                                       | 0/5 [00:00<?, ?it/s]You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
Loading pipeline components...: 100%|███████████████████████████████| 5/5 [00:00<00:00,  5.12it/s]
Token indices sequence length is longer than the specified maximum sequence length for this model (95 > 77). Running this sequence through the model will result in indexing errors
The following part of your input was truncated because CLIP can only handle sequences up to 77 tokens: ['. the scene feels intimate and contemplative, capturing a quiet, introspective moment, mj']
100%|█████████████████████████████████████████████████████████████| 40/40 [08:10<00:00, 12.25s/it]
The following part of your input was truncated because CLIP can only handle sequences up to 77 tokens: ['. the scene feels intimate and contemplative, capturing a quiet, introspective moment, mj']
  4%|██▍                                                           | 2/50 [01:52<43:27, 54.32s/it]

System Info

  • 🤗 Diffusers version: 0.33.0.dev0
  • Platform: Windows-10-10.0.26100-SP0
  • Running on Google Colab?: No
  • Python version: 3.10.11
  • PyTorch version (GPU?): 2.5.1+cu124 (True)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Huggingface_hub version: 0.27.1
  • Transformers version: 4.48.1
  • Accelerate version: 1.4.0.dev0
  • PEFT version: 0.14.1.dev0
  • Bitsandbytes version: 0.45.1
  • Safetensors version: 0.5.2
  • xFormers version: not installed
  • Accelerator: NVIDIA GeForce RTX 4060 Laptop GPU, 8188 MiB
  • Using GPU in script?:
  • Using distributed or parallel set-up in script?:

Who can help?

@yiyixuxu @DN6

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions