Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

follow-up refactor on lumina2 #10776

Merged
merged 9 commits into from
Feb 15, 2025
Merged

follow-up refactor on lumina2 #10776

merged 9 commits into from
Feb 15, 2025

Conversation

yiyixuxu
Copy link
Collaborator

@yiyixuxu yiyixuxu commented Feb 12, 2025

This PR:

  1. refactor and simplify ROPE: removed all the logic related to different image sizes (we do not need to support this for inference)
  2. for now, I switched the default for use_mask_in_transformer to be False because:
    • for single prompt (most common use case), the outputs are identical and we are getting a performance gain with use_mask_in_transformer=False (see details follow-up refactor on lumina2 #10776 (comment))
    • for list of prompts, mask should be used (see context Add support for lumina2 #10642 (comment)) - I think maybe we can remove this argument from pipeline, and automatically set to use mask when a list of prompts are passed (and otherwise set to be False)

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

encoder_hidden_states = layer(
encoder_hidden_states, attention_mask if use_mask_in_transformer else None, encoder_rotary_emb
)
encoder_hidden_states = layer(encoder_hidden_states, encoder_attention_mask, context_rotary_emb)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the slight difference we see in the output without the mask actually coming from here; I didn't see it has an effect in speed so I set it to always use encoder_attention_mask for the context_refiner layers,

with this ,for single-prompt, we are getting identical output for use_mask_in_transformer=True and use_mask_in_transformer=False;

testing script
# test lumina2
import torch
from diffusers import Lumina2Text2ImgPipeline
import itertools
from pathlib import Path
import shutil


device ="cuda:1"

branch = "refactor_lumina2"
# branch = "main"
params = {
    'use_mask_in_transformer': [True, False],  
}

# Generate all combinations
param_combinations = list(itertools.product(*params.values()))

# Create output directory (remove if exists)
output_dir = Path(f"yiyi_test_6_outputs_{branch}")
if output_dir.exists():
    shutil.rmtree(output_dir)
output_dir.mkdir(exist_ok=True)

pipe = Lumina2Text2ImgPipeline.from_pretrained("Alpha-VLLM/Lumina-Image-2.0", torch_dtype=torch.bfloat16).to(device)

prompt = [
    "focused exterior view on a living room of a limestone rock regular blocks shaped villa with sliding windows and timber screens in Provence along the cliff facing the sea, with waterfalls from the roof to the pool, designed by Zaha Hadid, with rocky textures and form, made of regular giant rock blocks stacked each other with infinity edge pool in front of it, blends in with the surrounding nature. Regular rock blocks. Giant rock blocks shaping the space. The image to capture the infinity edge profile of the pool and the flow of water going down creating a waterfall effect. Adriatic Sea. The design is sustainable and semi prefab. The photo is shot on a canon 5D mark 4",
    # "A capybara holding a sign that reads Hello World"
]

# Run test for each combination
for (mask,) in param_combinations:
    print(f"\nTesting combination:")
    print(f"  use_mask_in_transformer: {mask}")
    
    # Generate image
    generator = torch.Generator(device=device).manual_seed(0)
    images = pipe(
        prompt=prompt,
        num_inference_steps=25,
        use_mask_in_transformer=mask,
        generator=generator,
    ).images
    
    # Save images
    for i, image in enumerate(images):
        output_path = output_dir / f"output_mask{int(mask)}_prompt{i}.png"
        image.save(output_path)
        print(f"Saved to: {output_path}")

@asomoza
Copy link
Member

asomoza commented Feb 13, 2025

nice! I can confirm that I get the same image which in turn makes it better for text generation when it failed some times before without a mask.

@yiyixuxu
Copy link
Collaborator Author

@asomoza @a-r-r-o-w @hlky
are we ok to remove the use_mask_in_transforme argument from the pipeline (so user cannot set this anymore), and only use the mask in transformer when we need to(multiple prompts -> multiple values for cap_seq_len in transformers)

@a-r-r-o-w
Copy link
Member

Sounds good @yiyixuxu

@yiyixuxu yiyixuxu merged commit 69f919d into main Feb 15, 2025
15 checks passed
@yiyixuxu yiyixuxu deleted the lumina2-refactor branch February 15, 2025 00:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants