Kolors piplines produce black images on ROCm #10715

Teriks · 2025-02-04T11:30:50Z

Describe the bug

Generating images with Kolors pipelines produces black images on the torch ROCm backend.

fp16 VAE fix model does not appear to solve the issue.

The normal SDXL pipeline works fine, the code for VAE decoding is basically identical there, so I am not sure if it is related to the VAE.

I would like to possibly contribute some community pipelines related to Kolors, mainly ControlNet and Inpainting variants and just discovered while testing.

This issue also affects the pipelines I have built here: https://github.com/Teriks/dgenerate/tree/version_5.0.0/dgenerate/extras/kolors/pipelines

Any idea where to start looking? Could be related to the text encoder?

Reproduction

import diffusers
import torch

pipe = diffusers.KolorsPipeline.from_pretrained(
    'Kwai-Kolors/Kolors-diffusers', variant='fp16', torch_dtype=torch.float16)


pipe.to('cuda')


result = pipe(prompt='hello world')

result.images[0].save('test.png')

Logs

System Info

diffusers 0.32.2
transformers 4.48.1
torch 2.5.1

Who can help?

No response

asomoza · 2025-02-04T11:38:55Z

HI @Teriks

Any idea where to start looking? Could be related to the text encoder?

yes, the only major difference with the normal SDXL model is the text encoder, in this case, it's a custom implementation made by the model authors.

Probably the best approach here is to try to integrate it with transformers but that's not an easy task.

hlky · 2025-02-04T13:32:59Z

Hi @Teriks

Check VAE precision

pipe.vae = pipe.vae.to(torch.float32)
pipe.vae.enable_tiling() # to reduce memory usage because of higher precision

Kwai-Kolors/Kolors-diffusers's VAE config does not specify force_upcast so this isn't happening automatically

Check prompt embeds

Either manually call pipe.encode_prompt and check the output or print in the pipeline

diffusers/src/diffusers/pipelines/kolors/pipeline_kolors.py

Lines 873 to 889 in f63d322

    
           ( 
        
               prompt_embeds, 
        
               negative_prompt_embeds, 
        
               pooled_prompt_embeds, 
        
               negative_pooled_prompt_embeds, 
        
           ) = self.encode_prompt( 
        
               prompt=prompt, 
        
               device=device, 
        
               num_images_per_prompt=num_images_per_prompt, 
        
               do_classifier_free_guidance=self.do_classifier_free_guidance, 
        
               negative_prompt=negative_prompt, 
        
               prompt_embeds=prompt_embeds, 
        
               pooled_prompt_embeds=pooled_prompt_embeds, 
        
               negative_prompt_embeds=negative_prompt_embeds, 
        
               negative_pooled_prompt_embeds=negative_pooled_prompt_embeds, 
        
           )

Check noise_pred

Easiest to print in the pipeline

diffusers/src/diffusers/pipelines/kolors/pipeline_kolors.py

Lines 995 to 1004 in f63d322

    
           noise_pred = self.unet( 
        
               latent_model_input, 
        
               t, 
        
               encoder_hidden_states=prompt_embeds, 
        
               timestep_cond=timestep_cond, 
        
               cross_attention_kwargs=self.cross_attention_kwargs, 
        
               added_cond_kwargs=added_cond_kwargs, 
        
               return_dict=False, 
        
           )[0]

If it's noise_pred that contains NaN then we'd need to check further to find the layer/module with the first NaN.

Once we locate the source of the problem we can look at possible solutions, this could be forcing casting somewhere or we may need to raise the issue to PyTorch.

Teriks · 2025-02-04T22:53:53Z

@hlky thanks, I didn’t think to check if config was actually forcing the upcast on the VAE

I will try first with torch 2.6.0 to see if it resolves. Then I’ll will try forcing the upcast from code (regardless of config) to see if there is still an overflow on ROCm when I get time today

I can get the outputs of the text_encoder and unet on ROCm if that does not resolve anything

asomoza · 2025-02-04T23:28:52Z

Just my thoughts again, it's been a while since I added kolors, but the VAE upcast problem is a SDXL 1.0 base problem, the reason this pipeline doesn't have it, it's because the VAE for kolors doesn't have this problem, same as other pipelines.

It's still worth checking just in case, but if the problem was the VAE, we would have this problem for any type of GPU.

Also worth checking too, but the unet it's the exact same arch as SDXL, so kind of the same, if there was a problem with the unet, AMD GPUs would have this problem for all SDXL models.

Just as a time saving recommendation, I would suggest you start with the text encoder before the VAE and the UNET.

Teriks · 2025-02-05T00:05:57Z

@asomoza

A quick test reveals that torch 2.6.0 + rocm6.2.4 is able to generate images without any modifications to code.

Previously I was on torch 2.5.1 + rocm6.2

My guess is that you’re correct on it not being the VAE, It is probably a layer in their custom model that is poorly supported by the ROCm backend + torch in the versions I was experiencing this on

I’ll try to narrow it down to the torch version, or the backend version, I am guessing torch 2.5.1 + rocm6.2.4 would work if it is possible to build an environment with that, but will see

I’ll be able to compare all the module outputs between versions but yea it’s probably the encoder

asomoza · 2025-02-05T00:25:57Z

@Teriks thanks for investigating this, it will help us to answer issues in the future, I'll try to test with an AMD GPU too.

Teriks · 2025-02-05T01:47:52Z

@asomoza fyi the only possible environment where it is working is torch 2.6.0 + rocm6.2.4, as there is not a wheel for torch 2.5.1 + rocm6.2.4

pip install diffusers accelerate sentencepiece transformers torch==2.6.0 --extra-index-url https://download.pytorch.org/whl/rocm6.2.4/

And non working:

pip install diffusers accelerate sentencepiece transformers torch==2.5.1 --extra-index-url https://download.pytorch.org/whl/rocm6.2/

Teriks · 2025-02-05T02:39:28Z

Hint in the broken environment upon decode, VAE upcasting does not fix, so definitely not VAE

/workspace/venv_broke/lib/python3.10/site-packages/diffusers/image_processor.py:147: RuntimeWarning: invalid value encountered in cast images = (images * 255).round().astype("uint8")

Here is a test script that saves the module outputs to .pt with torch.save

https://gist.github.com/Teriks/08d19127db30b69cf09d76858a698a08

The outputs from the text encoder are NaN which is indeed where the problem is

The text encoder is not working correctly on torch 2.5.1 and rocm6.2

github-actions · 2025-03-06T15:02:52Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

iwr-redmond · 2025-03-08T08:00:05Z

Bumping. I reckon that ROCm support is going to be increasingly important with the advent of Strix Halo.

hlky · 2025-03-08T08:05:14Z

@iwr-redmond This is an issue with torch 2.5.1 and rocm6.2, latest versions work so future rocm gpus will not be affected.

iwr-redmond · 2025-03-08T08:22:52Z

@Teriks there are builds of pytorch 2.5.1 for ROCm 6.3.2 here that might help to tease this out.

As an aside, the referenced build should include support for gfx1151 (Strix Halo).

hlky · 2025-03-08T08:26:27Z

@iwr-redmond Current stable pytorch is 2.6. Generally we recommend using the latest stable version.

Closing as this issue can be considered resolved.

iwr-redmond · 2025-03-08T08:35:10Z

Generally we recommend using the latest stable version.

That may be, but the current version of diffusers is supported back to pytorch 1.4. Surely current -1 isn't too much of a stretch?

hlky · 2025-03-08T08:42:10Z

A specific build of torch, 2.5.1 with rocm6.2 is broken. Users are free to try the builds provided by AMD which may or may not resolve the issue if the problem was with torch's kernel rather than rocm itself. Official builds of 2.6 are known to work, therefore this issue can be considered resolved. There isn't anything else we can do from our end.

iwr-redmond · 2025-03-08T08:55:46Z

There isn't anything else we can do from our end.

Key issue. Thank you!

Teriks added the bug Something isn't working label Feb 4, 2025

github-actions bot added the stale Issues that haven't received updates label Mar 6, 2025

hlky closed this as completed Mar 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kolors piplines produce black images on ROCm #10715

Kolors piplines produce black images on ROCm #10715

Teriks commented Feb 4, 2025 •

edited

Loading

asomoza commented Feb 4, 2025

hlky commented Feb 4, 2025

Teriks commented Feb 4, 2025

asomoza commented Feb 4, 2025

Teriks commented Feb 5, 2025 •

edited

Loading

asomoza commented Feb 5, 2025

Teriks commented Feb 5, 2025

Teriks commented Feb 5, 2025 •

edited

Loading

github-actions bot commented Mar 6, 2025

iwr-redmond commented Mar 8, 2025 •

edited

Loading

hlky commented Mar 8, 2025

iwr-redmond commented Mar 8, 2025

hlky commented Mar 8, 2025

iwr-redmond commented Mar 8, 2025

hlky commented Mar 8, 2025

iwr-redmond commented Mar 8, 2025

Kolors piplines produce black images on ROCm #10715

Kolors piplines produce black images on ROCm #10715

Comments

Teriks commented Feb 4, 2025 • edited Loading

Describe the bug

Reproduction

Logs

System Info

Who can help?

asomoza commented Feb 4, 2025

hlky commented Feb 4, 2025

Teriks commented Feb 4, 2025

asomoza commented Feb 4, 2025

Teriks commented Feb 5, 2025 • edited Loading

asomoza commented Feb 5, 2025

Teriks commented Feb 5, 2025

Teriks commented Feb 5, 2025 • edited Loading

github-actions bot commented Mar 6, 2025

iwr-redmond commented Mar 8, 2025 • edited Loading

hlky commented Mar 8, 2025

iwr-redmond commented Mar 8, 2025

hlky commented Mar 8, 2025

iwr-redmond commented Mar 8, 2025

hlky commented Mar 8, 2025

iwr-redmond commented Mar 8, 2025

Teriks commented Feb 4, 2025 •

edited

Loading

Teriks commented Feb 5, 2025 •

edited

Loading

Teriks commented Feb 5, 2025 •

edited

Loading

iwr-redmond commented Mar 8, 2025 •

edited

Loading