Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kolors piplines produce black images on ROCm #10715

Closed
Teriks opened this issue Feb 4, 2025 · 16 comments
Closed

Kolors piplines produce black images on ROCm #10715

Teriks opened this issue Feb 4, 2025 · 16 comments
Labels
bug Something isn't working stale Issues that haven't received updates

Comments

@Teriks
Copy link
Contributor

Teriks commented Feb 4, 2025

Describe the bug

Generating images with Kolors pipelines produces black images on the torch ROCm backend.

fp16 VAE fix model does not appear to solve the issue.

The normal SDXL pipeline works fine, the code for VAE decoding is basically identical there, so I am not sure if it is related to the VAE.

I would like to possibly contribute some community pipelines related to Kolors, mainly ControlNet and Inpainting variants and just discovered while testing.

This issue also affects the pipelines I have built here: https://github.com/Teriks/dgenerate/tree/version_5.0.0/dgenerate/extras/kolors/pipelines

Any idea where to start looking? Could be related to the text encoder?

Reproduction

import diffusers
import torch

pipe = diffusers.KolorsPipeline.from_pretrained(
    'Kwai-Kolors/Kolors-diffusers', variant='fp16', torch_dtype=torch.float16)


pipe.to('cuda')


result = pipe(prompt='hello world')

result.images[0].save('test.png')

Logs

System Info

diffusers 0.32.2
transformers 4.48.1
torch 2.5.1

Who can help?

No response

@Teriks Teriks added the bug Something isn't working label Feb 4, 2025
@asomoza
Copy link
Member

asomoza commented Feb 4, 2025

HI @Teriks

Any idea where to start looking? Could be related to the text encoder?

yes, the only major difference with the normal SDXL model is the text encoder, in this case, it's a custom implementation made by the model authors.

Probably the best approach here is to try to integrate it with transformers but that's not an easy task.

@hlky
Copy link
Member

hlky commented Feb 4, 2025

Hi @Teriks

  1. Check VAE precision
pipe.vae = pipe.vae.to(torch.float32)
pipe.vae.enable_tiling() # to reduce memory usage because of higher precision

Kwai-Kolors/Kolors-diffusers's VAE config does not specify force_upcast so this isn't happening automatically

  1. Check prompt embeds

Either manually call pipe.encode_prompt and check the output or print in the pipeline

(
prompt_embeds,
negative_prompt_embeds,
pooled_prompt_embeds,
negative_pooled_prompt_embeds,
) = self.encode_prompt(
prompt=prompt,
device=device,
num_images_per_prompt=num_images_per_prompt,
do_classifier_free_guidance=self.do_classifier_free_guidance,
negative_prompt=negative_prompt,
prompt_embeds=prompt_embeds,
pooled_prompt_embeds=pooled_prompt_embeds,
negative_prompt_embeds=negative_prompt_embeds,
negative_pooled_prompt_embeds=negative_pooled_prompt_embeds,
)

  1. Check noise_pred

Easiest to print in the pipeline

noise_pred = self.unet(
latent_model_input,
t,
encoder_hidden_states=prompt_embeds,
timestep_cond=timestep_cond,
cross_attention_kwargs=self.cross_attention_kwargs,
added_cond_kwargs=added_cond_kwargs,
return_dict=False,
)[0]

If it's noise_pred that contains NaN then we'd need to check further to find the layer/module with the first NaN.

Once we locate the source of the problem we can look at possible solutions, this could be forcing casting somewhere or we may need to raise the issue to PyTorch.

@Teriks
Copy link
Contributor Author

Teriks commented Feb 4, 2025

@hlky thanks, I didn’t think to check if config was actually forcing the upcast on the VAE

I will try first with torch 2.6.0 to see if it resolves. Then I’ll will try forcing the upcast from code (regardless of config) to see if there is still an overflow on ROCm when I get time today

I can get the outputs of the text_encoder and unet on ROCm if that does not resolve anything

@asomoza
Copy link
Member

asomoza commented Feb 4, 2025

Just my thoughts again, it's been a while since I added kolors, but the VAE upcast problem is a SDXL 1.0 base problem, the reason this pipeline doesn't have it, it's because the VAE for kolors doesn't have this problem, same as other pipelines.

It's still worth checking just in case, but if the problem was the VAE, we would have this problem for any type of GPU.

Also worth checking too, but the unet it's the exact same arch as SDXL, so kind of the same, if there was a problem with the unet, AMD GPUs would have this problem for all SDXL models.

Just as a time saving recommendation, I would suggest you start with the text encoder before the VAE and the UNET.

@Teriks
Copy link
Contributor Author

Teriks commented Feb 5, 2025

@asomoza

A quick test reveals that torch 2.6.0 + rocm6.2.4 is able to generate images without any modifications to code.

Previously I was on torch 2.5.1 + rocm6.2

My guess is that you’re correct on it not being the VAE, It is probably a layer in their custom model that is poorly supported by the ROCm backend + torch in the versions I was experiencing this on

I’ll try to narrow it down to the torch version, or the backend version, I am guessing torch 2.5.1 + rocm6.2.4 would work if it is possible to build an environment with that, but will see

I’ll be able to compare all the module outputs between versions but yea it’s probably the encoder

@asomoza
Copy link
Member

asomoza commented Feb 5, 2025

@Teriks thanks for investigating this, it will help us to answer issues in the future, I'll try to test with an AMD GPU too.

@Teriks
Copy link
Contributor Author

Teriks commented Feb 5, 2025

@asomoza fyi the only possible environment where it is working is torch 2.6.0 + rocm6.2.4, as there is not a wheel for torch 2.5.1 + rocm6.2.4

pip install diffusers accelerate sentencepiece transformers torch==2.6.0 --extra-index-url https://download.pytorch.org/whl/rocm6.2.4/

And non working:

pip install diffusers accelerate sentencepiece transformers torch==2.5.1 --extra-index-url https://download.pytorch.org/whl/rocm6.2/

@Teriks
Copy link
Contributor Author

Teriks commented Feb 5, 2025

Hint in the broken environment upon decode, VAE upcasting does not fix, so definitely not VAE

/workspace/venv_broke/lib/python3.10/site-packages/diffusers/image_processor.py:147: RuntimeWarning: invalid value encountered in cast images = (images * 255).round().astype("uint8")

Here is a test script that saves the module outputs to .pt with torch.save

https://gist.github.com/Teriks/08d19127db30b69cf09d76858a698a08

The outputs from the text encoder are NaN which is indeed where the problem is

The text encoder is not working correctly on torch 2.5.1 and rocm6.2

Copy link
Contributor

github-actions bot commented Mar 6, 2025

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@github-actions github-actions bot added the stale Issues that haven't received updates label Mar 6, 2025
@iwr-redmond
Copy link

iwr-redmond commented Mar 8, 2025

Bumping. I reckon that ROCm support is going to be increasingly important with the advent of Strix Halo.

@hlky
Copy link
Member

hlky commented Mar 8, 2025

@iwr-redmond This is an issue with torch 2.5.1 and rocm6.2, latest versions work so future rocm gpus will not be affected.

@iwr-redmond
Copy link

@Teriks there are builds of pytorch 2.5.1 for ROCm 6.3.2 here that might help to tease this out.

As an aside, the referenced build should include support for gfx1151 (Strix Halo).

@hlky
Copy link
Member

hlky commented Mar 8, 2025

@iwr-redmond Current stable pytorch is 2.6. Generally we recommend using the latest stable version.

Closing as this issue can be considered resolved.

@hlky hlky closed this as completed Mar 8, 2025
@iwr-redmond
Copy link

Generally we recommend using the latest stable version.

That may be, but the current version of diffusers is supported back to pytorch 1.4. Surely current -1 isn't too much of a stretch?

@hlky
Copy link
Member

hlky commented Mar 8, 2025

A specific build of torch, 2.5.1 with rocm6.2 is broken. Users are free to try the builds provided by AMD which may or may not resolve the issue if the problem was with torch's kernel rather than rocm itself. Official builds of 2.6 are known to work, therefore this issue can be considered resolved. There isn't anything else we can do from our end.

@iwr-redmond
Copy link

There isn't anything else we can do from our end.

Key issue. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working stale Issues that haven't received updates
Projects
None yet
Development

No branches or pull requests

4 participants