Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stable Diffusion XL for Gaudi #619

Merged
merged 19 commits into from
Jan 9, 2024
Merged

Stable Diffusion XL for Gaudi #619

merged 19 commits into from
Jan 9, 2024

Conversation

dsocek
Copy link
Contributor

@dsocek dsocek commented Dec 29, 2023

What does this PR do?

  • Implements Stable Diffusion XL (SDXL) pipeline for Gaudi
  • Implements Euler Discrete and Euler Ancestral Discrete schedulers for Gaudi
  • Adds examples (documentation and script) to for running SDXL base and SDXL turbo inference on Guadi
  • Adds SDXL related tests for CI
  • Adds SDXL to mdx

@dsocek dsocek requested a review from regisss as a code owner December 29, 2023 19:38
Copy link
Collaborator

@libinta libinta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi, @dsocek, can you start to check the comments? thanks

@dsocek dsocek requested a review from mandy-li as a code owner January 4, 2024 00:58
@dsocek dsocek requested a review from a user January 4, 2024 00:58
@dsocek
Copy link
Contributor Author

dsocek commented Jan 4, 2024

hi, @dsocek, can you start to check the comments? thanks

@libinta we addressed your concerns, can you help check the update?

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@dsocek
Copy link
Contributor Author

dsocek commented Jan 5, 2024

@regisss could you also please provide your insights on this PR?

Signed-off-by: Daniel Socek <[email protected]>
@regisss regisss added the run-test Run CI for PRs from external contributors label Jan 5, 2024
Copy link
Collaborator

@regisss regisss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very clean PR! I left a few minor comments.

Have you benchmarked it a bit and got some throughput numbers?

Comment on lines 120 to 126
model_cpu_offload_seq = "text_encoder->text_encoder_2->unet->vae"
_optional_components = ["tokenizer", "tokenizer_2", "text_encoder", "text_encoder_2"]
_callback_tensor_inputs = [
"latents_batch",
"text_embeddings_batch",
"add_text_embeddings_batch",
"add_time_ids_batch",
]
Copy link
Collaborator

@regisss regisss Jan 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can remove these lines as the class already inherits them from StableDiffusionXLPipeline

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. I also tried to be more correct in handling other callback input (maybe similar fix to SD pipeline? there is some pairing that takes place which is not addressed in original gaudi sd pipe).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm not sure I got it, it doesn't recognize the tensors to callback properly?
And so you modified the default value in __call__ right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not 100% confident how this should be handled. Initially what I tried is to define _callback_tensor_inputs directly for current derived class:

_callback_tensor_inputs = [
        "latents_batch",
        "text_embeddings_batch",
        "add_text_embeddings_batch",
        "add_time_ids_batch",
    ]

So on callbacks these inputs can be popped out of stack when callback occurs.

Now, in the current PR version, instead of that approach I tried to pass callback input tensor from the base class and then adjust them properly by doing the pairing (via torch.cat) to align with batched structures created with _split_inputs_into_batches inside Gaudi class:

callback input tensor from the base class:
[optimum/habana/diffusers/pipelines/stable_diffusion_xl/pipeline_stable_diffusion_xl.py]
Line 310:

callback_on_step_end_tensor_inputs: List[str] = [
            "latents",
            "prompt_embeds",
            "negative_prompt_embeds",
            "add_text_embeds",
            "add_time_ids",
            "negative_pooled_prompt_embeds",
            "negative_add_time_ids",
        ],

and then pairing then manually when popped from callback stack:
[optimum/habana/diffusers/pipelines/stable_diffusion_xl/pipeline_stable_diffusion_xl.py]
Lines 722-734

I am not sure what would be the best/correct implementation for this part..

I am also confused about the original stable diffusion pipeline for Gaudi. If you look at the how its implemented, there are also inherited callback input tenors:

https://github.com/huggingface/optimum-habana/blob/main/optimum/habana/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py#L222
callback_on_step_end_tensor_inputs: List[str] = ["latents"],

However in in lines 474-476 it pops both latents and prompt_embeds. Here, prompt_embeds are popped into text_embeddings_batch which expects a catenated version (see https://github.com/huggingface/optimum-habana/blob/main/optimum/habana/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py#L193)
https://github.com/huggingface/optimum-habana/blob/main/optimum/habana/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py#L474

That looks kind of fishy, unless I am completely missing something obvious :)

@dsocek
Copy link
Contributor Author

dsocek commented Jan 5, 2024

Very clean PR! I left a few minor comments.

Have you benchmarked it a bit and got some throughput numbers?

Yes indeed we benchmarked it on single Gaudi2 HPU. Here is a snapshot:

1 prompt(s) received, 20 generation(s) per prompt, 4 sample(s) per batch, 5 total batch(es).
Speed metrics: {'generation_runtime': 180.7683, 'generation_samples_per_second': 0.205, 'generation_steps_per_second': 0.086}

dsocek and others added 2 commits January 5, 2024 23:07
Add negative_prompt_embeds and negative_pooled_prompt_embeds check for
sdxl turbo.
@dsocek dsocek requested review from skavulya and regisss January 5, 2024 23:34
@regisss regisss added run-test Run CI for PRs from external contributors and removed run-test Run CI for PRs from external contributors labels Jan 8, 2024
Co-authored-by: regisss <[email protected]>
@dsocek dsocek requested a review from regisss January 8, 2024 15:50
@regisss regisss added run-test Run CI for PRs from external contributors and removed run-test Run CI for PRs from external contributors labels Jan 9, 2024
Copy link
Collaborator

@regisss regisss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@regisss regisss merged commit db3f0a0 into huggingface:main Jan 9, 2024
11 of 12 checks passed
@regisss regisss mentioned this pull request Jan 23, 2024
3 tasks
jychen21 pushed a commit to jychen21/optimum-habana that referenced this pull request Feb 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
run-test Run CI for PRs from external contributors
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants