-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kolors piplines produce black images on ROCm #10715
Comments
HI @Teriks
yes, the only major difference with the normal SDXL model is the text encoder, in this case, it's a custom implementation made by the model authors. Probably the best approach here is to try to integrate it with transformers but that's not an easy task. |
Hi @Teriks
pipe.vae = pipe.vae.to(torch.float32)
pipe.vae.enable_tiling() # to reduce memory usage because of higher precision
Either manually call diffusers/src/diffusers/pipelines/kolors/pipeline_kolors.py Lines 873 to 889 in f63d322
Easiest to print in the pipeline diffusers/src/diffusers/pipelines/kolors/pipeline_kolors.py Lines 995 to 1004 in f63d322
If it's Once we locate the source of the problem we can look at possible solutions, this could be forcing casting somewhere or we may need to raise the issue to PyTorch. |
@hlky thanks, I didn’t think to check if config was actually forcing the upcast on the VAE I will try first with torch 2.6.0 to see if it resolves. Then I’ll will try forcing the upcast from code (regardless of config) to see if there is still an overflow on ROCm when I get time today I can get the outputs of the text_encoder and unet on ROCm if that does not resolve anything |
Just my thoughts again, it's been a while since I added kolors, but the VAE upcast problem is a SDXL 1.0 base problem, the reason this pipeline doesn't have it, it's because the VAE for kolors doesn't have this problem, same as other pipelines. It's still worth checking just in case, but if the problem was the VAE, we would have this problem for any type of GPU. Also worth checking too, but the unet it's the exact same arch as SDXL, so kind of the same, if there was a problem with the unet, AMD GPUs would have this problem for all SDXL models. Just as a time saving recommendation, I would suggest you start with the text encoder before the VAE and the UNET. |
A quick test reveals that torch 2.6.0 + rocm6.2.4 is able to generate images without any modifications to code. Previously I was on torch 2.5.1 + rocm6.2 My guess is that you’re correct on it not being the VAE, It is probably a layer in their custom model that is poorly supported by the ROCm backend + torch in the versions I was experiencing this on I’ll try to narrow it down to the torch version, or the backend version, I am guessing torch 2.5.1 + rocm6.2.4 would work if it is possible to build an environment with that, but will see I’ll be able to compare all the module outputs between versions but yea it’s probably the encoder |
@Teriks thanks for investigating this, it will help us to answer issues in the future, I'll try to test with an AMD GPU too. |
@asomoza fyi the only possible environment where it is working is
And non working:
|
Hint in the broken environment upon decode, VAE upcasting does not fix, so definitely not VAE
Here is a test script that saves the module outputs to .pt with https://gist.github.com/Teriks/08d19127db30b69cf09d76858a698a08 The outputs from the text encoder are NaN which is indeed where the problem is The text encoder is not working correctly on torch 2.5.1 and rocm6.2 |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
Bumping. I reckon that ROCm support is going to be increasingly important with the advent of Strix Halo. |
@iwr-redmond This is an issue with torch 2.5.1 and rocm6.2, latest versions work so future rocm gpus will not be affected. |
@Teriks there are builds of pytorch 2.5.1 for ROCm 6.3.2 here that might help to tease this out. As an aside, the referenced build should include support for gfx1151 (Strix Halo). |
@iwr-redmond Current stable pytorch is 2.6. Generally we recommend using the latest stable version. Closing as this issue can be considered resolved. |
That may be, but the current version of diffusers is supported back to pytorch 1.4. Surely current -1 isn't too much of a stretch? |
A specific build of torch, |
Key issue. Thank you! |
Describe the bug
Generating images with Kolors pipelines produces black images on the torch ROCm backend.
fp16 VAE fix model does not appear to solve the issue.
The normal SDXL pipeline works fine, the code for VAE decoding is basically identical there, so I am not sure if it is related to the VAE.
I would like to possibly contribute some community pipelines related to Kolors, mainly ControlNet and Inpainting variants and just discovered while testing.
This issue also affects the pipelines I have built here: https://github.com/Teriks/dgenerate/tree/version_5.0.0/dgenerate/extras/kolors/pipelines
Any idea where to start looking? Could be related to the text encoder?
Reproduction
Logs
System Info
diffusers 0.32.2
transformers 4.48.1
torch 2.5.1
Who can help?
No response
The text was updated successfully, but these errors were encountered: