Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MultiControlNetUnionModel on SDXL #10747

Merged
merged 8 commits into from
Feb 12, 2025

Conversation

guiyrt
Copy link
Contributor

@guiyrt guiyrt commented Feb 7, 2025

What does this PR do?

New MultiControlNetUnionModel wrapper class to handle multiple ControlNetUnionModels, similarly to MultiControlNetModel. Addressed in #10656 to control start, end and scale of each condition image.

Input
Segmentation Pose
Inference code
import torch

from diffusers import StableDiffusionXLControlNetUnionPipeline
from diffusers.models import ControlNetUnionModel, AutoencoderKL
from diffusers.utils import load_image

model_id = "stabilityai/stable-diffusion-xl-base-1.0"
controlnet_id = "brad-twinkl/controlnet-union-sdxl-1.0-promax"

controlnet = ControlNetUnionModel.from_pretrained(
    "brad-twinkl/controlnet-union-sdxl-1.0-promax", torch_dtype=torch.float16
)

pipe = StableDiffusionXLControlNetUnionPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    controlnet=[controlnet, controlnet],
    vae = AutoencoderKL.from_pretrained(
        "madebyollin/sdxl-vae-fp16-fix", torch_dtype=torch.float16
    ),
    torch_dtype=torch.float16,
    variant="fp16",
)

room_seg_img = load_image("https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/sd_controlnet/room_seg.png")
pose_img = load_image("https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/sd_controlnet/pose.png")


pipe.enable_model_cpu_offload()

image = pipe(
    prompt="an astronaut in space",
    width=1024,
    height=1024,
    negative_prompt="lowres, low quality, worst quality",
    generator=torch.manual_seed(42),
    guidance_scale=5,
    num_inference_steps=50,
    control_image=[[pose_img], [room_seg_img]],
    control_mode=[[0], [5]]
).images[0]

image.save("result.jpg")

First, I ran the pipeline as before, using a single ControlNetUnionModel with pose, segmentation and pose+segmentation conditions, to have outputs to compare with.

ControlNetUnionModel

Segmentation Pose Segmentation + Pose

MultiControlNetUnionModel

Two instances of ControlNetUnionModel, one got the segmentation conditioning and the other the pose. To compare the output, I set controlnet_conditioning_scale to [0.0, 1.0] and [0.0, 1.0] to compare with the output of single conditioning using ControlNetUnionModel above. As we see (and expect), these outputs are the same. The output is different when using both segmentation and pose conditioning, which I think is expected.

Segmentation Pose Segmentation + Pose

Before submitting

Who can review?

@hlky @yiyixuxu @vladmandic @asomoza

@guiyrt guiyrt marked this pull request as draft February 7, 2025 20:43
@guiyrt guiyrt marked this pull request as ready for review February 7, 2025 21:12
@guiyrt
Copy link
Contributor Author

guiyrt commented Feb 7, 2025

Some points I had in my mind:

  1. What's the standing on control_image / image parameter naming? Is this something relevant to change here?
  2. Should I write new tests for MultiControlNetUnionModel?
  3. In the forward() of ControlNetUnionModel, we have the argument control_type_idx which is the multi-hot encoding of the conditions used. Moving this to inside the function would remove some code from the pipelines, as it is derived from control_type. And we wouldn't need to handle it differently for multi or single controlnet_union, as we do now. As we are updating the controlnet_union pipelines, we could easily change this, if you find it relevant :) (does not change pipeline public interface)
# Example, [2,5] -> (0,0,1,0,0,1,0,0)
if isinstance(controlnet, ControlNetUnionModel):
    control_type = torch.zeros(controlnet.config.num_control_type).scatter(0, torch.tensor(control_mode), 1)
elif isinstance(controlnet, MultiControlNetUnionModel):
    control_type = [
        torch.zeros(controlnet_.config.num_control_type).scatter(0, torch.tensor(control_mode_), 1)
        for control_mode_, controlnet_ in zip(control_mode, self.controlnet.nets)
    ]

@guiyrt guiyrt changed the title [WIP] MultiControlNetUnionModel on SDXL MultiControlNetUnionModel on SDXL Feb 7, 2025
Copy link
Member

@hlky hlky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. control_image / image is the standard naming, we'd like to keep it consistent across pipelines
  2. That would be great
  3. We have that in the pipeline to avoid re-computing it for every sampling step. The original had it in the model and used nonzero which caused a cuda sync every step. We changed it to avoid the sync, using scatter here is nice though, looks like we can use scatter_ to be in-place.

@hlky hlky requested a review from yiyixuxu February 7, 2025 22:33
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@guiyrt
Copy link
Contributor Author

guiyrt commented Feb 8, 2025

  1. control_image / image is the standard naming, we'd like to keep it consistent across pipelines

👍, was checking because of #10131 (comment)

  1. That would be great

👍, will add new test class

  1. We have that in the pipeline to avoid re-computing it for every sampling step. The original had it in the model and used nonzero which caused a cuda sync every step. We changed it to avoid the sync, using scatter here is nice though, looks like we can use scatter_ to be in-place.

That makes sense, keeping it as is and updating to use _scatter instead

@john09282922
Copy link

@guiyrt, Hi, thanks for awesome work, How can I use multi-controlnet-union? is it code? and also can you give me example for each condition scale?

@yiyixuxu
Copy link
Collaborator

@hlky feel free to merge once it looks good to you!

@hlky
Copy link
Member

hlky commented Feb 10, 2025

@guiyrt Can you check the output of ControlNetUnionModel with Segmentation + Pose against main?

@guiyrt
Copy link
Contributor Author

guiyrt commented Feb 10, 2025

@guiyrt Can you check the output of ControlNetUnionModel with Segmentation + Pose against main?

Yep, I'll post it here in a sec.

I'm working on the tests, one of the issues is related to enable_sequential_cpu_offload, we have the same problem we had with the SigLip image encoder we used for SD3 IP-Adapter, not sure if you remember (
#9987 (comment)). ControlNetUnionModel uses nn.MultiheadAttention, which only works with enable_sequential_cpu_offload if you exclude it from offloading. Only then it passes the tests test_sequential_cpu_offload_forward_passand test_sequential_cpu_offload_forward_pass.

@guiyrt
Copy link
Contributor Author

guiyrt commented Feb 10, 2025

@guiyrt, Hi, thanks for awesome work, How can I use multi-controlnet-union? is it code? and also can you give me example for each condition scale?

Hi @john09282922, example inference code is above in a dropdown of the PR description, but I'll paste here again :). To change condition scale, start and end, you just need to pass them as you would normally, but now in a list. For example, you now pass control_guidance_start as [0.1, 0.0, 0.5], if you have 3 controlnet_unions and 0.1 goes to the first, 0.0 to the second and 0.5 to the third. Same for control_guidance_end and controlnet_conditioning_scale.

Inference code
import torch

from diffusers import StableDiffusionXLControlNetUnionPipeline
from diffusers.models import ControlNetUnionModel, AutoencoderKL
from diffusers.utils import load_image

model_id = "stabilityai/stable-diffusion-xl-base-1.0"
controlnet_id = "brad-twinkl/controlnet-union-sdxl-1.0-promax"

controlnet = ControlNetUnionModel.from_pretrained(
    "brad-twinkl/controlnet-union-sdxl-1.0-promax", torch_dtype=torch.float16
)

pipe = StableDiffusionXLControlNetUnionPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    controlnet=[controlnet, controlnet],
    vae = AutoencoderKL.from_pretrained(
        "madebyollin/sdxl-vae-fp16-fix", torch_dtype=torch.float16
    ),
    torch_dtype=torch.float16,
    variant="fp16",
)

room_seg_img = load_image("https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/sd_controlnet/room_seg.png")
pose_img = load_image("https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/sd_controlnet/pose.png")


pipe.enable_model_cpu_offload()

image = pipe(
    prompt="an astronaut in space",
    width=1024,
    height=1024,
    negative_prompt="lowres, low quality, worst quality",
    generator=torch.manual_seed(42),
    guidance_scale=5,
    num_inference_steps=50,
    control_image=[[pose_img], [room_seg_img]],
    control_mode=[[0], [5]]
).images[0]

image.save("result.jpg")

@guiyrt
Copy link
Contributor Author

guiyrt commented Feb 10, 2025

@guiyrt Can you check the output of ControlNetUnionModel with Segmentation + Pose against main?

@hlky Corporate needs you to find the differences between this picture and this picture

Main 34ab1af
Inference code
import torch

from diffusers import StableDiffusionXLControlNetUnionPipeline
from diffusers.models import ControlNetUnionModel, AutoencoderKL
from diffusers.utils import load_image

model_id = "stabilityai/stable-diffusion-xl-base-1.0"

controlnet = ControlNetUnionModel.from_pretrained(
    "brad-twinkl/controlnet-union-sdxl-1.0-promax", torch_dtype=torch.float16
)

pipe = StableDiffusionXLControlNetUnionPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    controlnet=controlnet,
    vae = AutoencoderKL.from_pretrained(
        "madebyollin/sdxl-vae-fp16-fix", torch_dtype=torch.float16
    ),
    torch_dtype=torch.float16,
    variant="fp16",
)

room_seg_img = load_image("https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/sd_controlnet/room_seg.png")
pose_img = load_image("https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/sd_controlnet/pose.png")


pipe.enable_model_cpu_offload()

image = pipe(
    prompt="an astronaut in a space station",
    width=1024,
    height=1024,
    negative_prompt="lowres, low quality, worst quality",
    generator=torch.manual_seed(42),
    guidance_scale=5,
    num_inference_steps=50,
    control_image=[pose_img, room_seg_img],
    control_mode=[0, 5],
).images[0]

image.save("result_main.jpg")

@hlky
Copy link
Member

hlky commented Feb 10, 2025

Thanks @guiyrt. Just checking output hasn't changed, I think we should be expecting the output to be the same between ControlNetUnionModel and MultiControlNetUnionModel, but both outputs for Segmentation + Pose look a little off to me.

cc @asomoza we're using control images of different resolutions here, and not resizing to the generation resolution, in your testing of ControlNetUnion does this affect the result? Do you have any recommendations for two control images that are known to work together?

@guiyrt
Copy link
Contributor Author

guiyrt commented Feb 10, 2025

Thanks @guiyrt. Just checking output hasn't changed, I think we should be expecting the output to be the same between ControlNetUnionModel and MultiControlNetUnionModel, but both outputs for Segmentation + Pose look a little off to me.

It works well for each condition, but to get the best results using both might require trying different controlnet_conditioning_scale, for example. Also, the best results I saw had both conditions complementing each other, which is not the case here.

cc @asomoza we're using control images of different resolutions here, and not resizing to the generation resolution, in your testing of ControlNetUnion does this affect the result?

The control images are passed by the VaeImageProcessor which resizes them, but aspect ratio is not conserved (pose in this case).

Original Processed

@asomoza
Copy link
Member

asomoza commented Feb 11, 2025

I'm not particularly fond of the auto resizing of the images for controlnet, the good controlnets are really affected by bad resizings or resolutions, so I would like it more if it throws an error instead of auto scaling the images specially if it messes up the aspect ratio, but that's another issue and not something for this PR.

@hlky, using a real scenario and with real images with the correct sizes and also with the correct resolution for the conditioning images (they have resolutions with the preprocessors too), one common scenario is to use a depth map with another one of the edges or lines ones, I like to use teed with a special combination I learned from anyline.

This combination makes it really easy to test because if you want a good image you'll need to lower the scales and the guidance ends, so you can tell when they're not working good together.

This is my test with it:

depth teed both
depth teed both

In this case, the depth map is the one that gives the overall scene composition and lighting and the teed one is the one that adds the details, specially for the river waves and the background.

How does it compare to a single controlnet in main? I can test it with specific conditioning scales and ends for each one, so testing with both conditions at 1.0:

single controlnet multi conditions multi controlnet
single both_full

They're different but in theory they should produce the same result. To test this, I tested each condition separately with the single controlnet union:

single controlnet depth single controlnet teed
single_depth single_teed

There's definitely something going on here, but if you ask me, I think and like more the result in the multicontrolnet from this PR than in the original single one. Also I mostly use it with multricontrolnets because I like to control each one with a different guidance scale and end.

To me it seems that the single controlnet with multiple conditions is not applying that much the depth condition and takes more into account the teed one.

Copy link
Member

@hlky hlky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @guiyrt

@guiyrt
Copy link
Contributor Author

guiyrt commented Feb 12, 2025

I'm not particularly fond of the auto resizing of the images for controlnet, the good controlnets are really affected by bad resizings or resolutions, so I would like it more if it throws an error instead of auto scaling the images specially if it messes up the aspect ratio, but that's another issue and not something for this PR.

I agree on this, is there something in place for the other controlnet pipelines regarding preserving aspect ratio? Otherwise, if you find value on this, I could work on something.

They're different but in theory they should produce the same result. To test this, I tested each condition separately with the single controlnet union:

I don't think an equal result is expected from "single controlnet multi conditions" and "multi controlnet". From the standpoint of controlnet inputs, the control embedding is different, and the fused control condition is also different, right? The output should be similar, but not equal.

@asomoza
Copy link
Member

asomoza commented Feb 12, 2025

I don't think an equal result is expected from "single controlnet multi conditions" and "multi controlnet". From the standpoint of controlnet inputs, the control embedding is different, and the fused control condition is also different, right? The output should be similar, but not equal.

yeah, probably word it bad, that's what I meant, they're different when comparing them, and as I wrote before, I don't use the multiple condition in the same controlnet union except for testing. I'll do some more testings later.

Also I wouldn't make this something that important or a blocker, controlnet union it's really weird and cool at the same time, you can even mix the conditions in one image and it will still work, also you can use conditions with other control types to get some interesting results and most of the time they will still work.

@yiyixuxu yiyixuxu merged commit 5105b5a into huggingface:main Feb 12, 2025
12 checks passed
@vladmandic
Copy link
Contributor

    control_image=[[pose_img], [room_seg_img]],
    control_mode=[[0], [5]]

why are those now list-of-lists?
this is not how any other multicontrolnets are as control_image is list-of-images?

@asomoza
Copy link
Member

asomoza commented Feb 12, 2025

isn't it because this is MultiControlNetUnion where each controlnet (union) accepts a list of images?

If I understand correctly what you're saying, you're suggesting that we make the controlnetunion a single image controlnet when used with multi controlnets?

@vladmandic
Copy link
Contributor

vladmandic commented Feb 12, 2025

yes, exactly.

  • its either single controlnet union with multi inputs (and then we deal with no independent scale/start/end) or
  • its multi controlnet union, each with single input

doing multi-of-multi is not something that can be effectively used - and makes assembling correct params a complete nightmare and non-standard with any other controlnet.

@elismasilva
Copy link
Contributor

@guiyrt If you apply a control mask to control pose you can improve segment control net, see this:

Without mask With mask
image image

I am using conditioning scale, see:
image

I changed prompt to: an astronaut in space, inside spaceship and steps only 30 steps.
If quality image is bad is ok iam using float8 inference.

PS: Iam not using your pipeline because I had already implemented the use of controlnet union in my pipeline before the official publication, but I believe the result will be the same in your case, what should help is the control mask.

@guiyrt
Copy link
Contributor Author

guiyrt commented Feb 13, 2025

  • its either single controlnet union with multi inputs (and then we deal with no independent scale/start/end) or
  • its multi controlnet union, each with single input

I looked into the Flux implementation for context, and it is indeed implemented differently. From what I got, there is no FluxControlNetUnionModel, there is only FluxControlNetModel where its forward() has the argument controlnet_mode as None for normal controlnets, and with a single value for controlnet unions. This means you cannot process multi-condition input on a single controlnet union execution. When you have multiple conditions, you instead use FluxMultiControlNetModel, but that calls your controlnet union for each condition input separately. In this case, you have reduced memory usage, as a single controlnet is loaded, but don't benefit from reduced execution time, compared to running two single-purpose controlnets.

doing multi-of-multi is not something that can be effectively used - and makes assembling correct params a complete nightmare and non-standard with any other controlnet.

I agree with you that in the current state, the list-of-lists input is confusing if you have many conditions or many controlnets, but I also don't think we should disregard the usage of multi-condition single execution if a controlnet union supports it. After giving some thought, maybe the ideal scenario would be the following:

  • ControNetModel: regular controlnets and controlnet unions in single-condition mode, like for Flux. Input is single condition image and single condition mode (can be None for normal contronets).
  • ControlNetUnionModel: exclusively for controlnet unions that operate in multi-condition mode, like in current version. Input is list of condition images and list of condition modes. Executes controlnet union once, independently of number of condition images.
  • MultiControlNetModel: multiple ControlNetModel, like in Flux. Input is list of condition images and list of condition modes. Executes controlnets multiple times, as many as condition images. Nets can be (like for Flux):
    • a single ControNetModel union operating in single-condition mode. This way, you can control condition scale/start/end individually.
    • a mix of normal ControNetModel and ControNetModel union in single-condition mode.
    • CANNOT BE a ControlNetUnionModel.

Comparing to current version, this would have the benefits of simplified interface for MultiControlNetModel as in FluxMultiControlNetModel, and you could still benefit of multi-condition single execution of controlnet unions via ControlNetUnionModel. The drawback is that you cannot use ControlNetUnionModel in MultiControlNetModel. If you want to use normal controlnets and controlnet unions, it can still be done but the controlnet unions in single-condition mode.

The more complex interface comes from MultiControlNetUnionModel, where you can pass a list of conditions for each of your controlnets. In this proposal that would be removed, and with that you then can't run multiple controlnet unions in multi-condition mode. Not sure how common of a use case it is, but if you want this functionality, you always need to pass a list of lists for condition images and modes.

And I also think there we wouldn't need separate controlnet_union pipelines, such as StableDiffusionXLControlNetUnionPipeline, as the interface would be the same (same as Flux currently).

@john09282922
Copy link

Hi, can you also merge with sdxl inpaint model? like pipeline_controlnet_union_inpaint_sd_xl.py.
@yiyixuxu @hlky

Thanks,

@hlky
Copy link
Member

hlky commented Feb 13, 2025

@vladmandic You requested this use case, @guiyrt has very kindly taken up MultiControlNetUnion and I've done experiments to use scale per condition without MultiControlNetUnion, #10723. Yes, it is a different interface, it is a unique model, all we need to do to handle this in any integration is an simple if else.

@vladmandic
Copy link
Contributor

vladmandic commented Feb 13, 2025

@hlky @guiyrt i'll make it work and i definitely do appreciate the work, i'm just worried about lack of standardization which significantly increases complexities for end user.
this could be avoided by non-structural changes and instead of throwing runtime error, it could be as simple as

   if isinstance(control_mode, list) and isinstance(control_mode[0], float)
      control_mode=[[x] for x in control_mode]
      control_image=[[x] for x in control_image]

@guiyrt
Copy link
Contributor Author

guiyrt commented Feb 13, 2025

@hlky @guiyrt i'll make it work and i definitely do appreciate the work, i'm just worried about lack of standardization which significantly increases complexities for end user. this could be avoided by non-structural changes and instead of throwing runtime error, it could be as simple as

   if isinstance(control_mode, list) and isinstance(control_mode[0], float)
      control_mode=[[x] for x in control_mode]
      control_image=[[x] for x in control_image]

If you have a single controlnet union, you can still pass a list of control images and list of control modes. But if you have more than one controlnet, there is no way around list of lists, otherwise we couldn't infer which input would go for which controlnet (assuming you want to pass multiple condition to a single controlnet union).

I'll put here an example using three conditions, and the differences in using one and two controlnets.

Single controlnet union

StableDiffusionXLControlNetUnionPipeline is instantiated with a single controlnet. All condition go to the single controlnet, control_image is a list of images and control_mode is a list of int.

import torch

from diffusers import StableDiffusionXLControlNetUnionPipeline
from diffusers.models import ControlNetUnionModel, AutoencoderKL
from diffusers.utils import load_image

controlnet = ControlNetUnionModel.from_pretrained(
    "brad-twinkl/controlnet-union-sdxl-1.0-promax", torch_dtype=torch.float16
)

pipe = StableDiffusionXLControlNetUnionPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    controlnet=controlnet,
    vae = AutoencoderKL.from_pretrained(
        "madebyollin/sdxl-vae-fp16-fix", torch_dtype=torch.float16
    ),
    torch_dtype=torch.float16,
    variant="fp16",
)

seg_img = load_image("https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/sd_controlnet/room_seg.png")
seg_mode = 5

pose_img = load_image("https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/sd_controlnet/pose.png")
pose_mode = 0

depth_img = load_image("https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/sd_controlnet/stormtrooper_depth.png")
depth_mode = 1

pipe.enable_model_cpu_offload()

image = pipe(
    prompt="an astronaut in space",
    width=1024,
    height=1024,
    negative_prompt="lowres, low quality, worst quality",
    generator=torch.manual_seed(42),
    guidance_scale=5,
    num_inference_steps=50,
    control_image=[pose_img, seg_img, depth_img],
    control_mode=[pose_mode, seg_mode, depth_mode]
).images[0]

image.save("result.jpg")

Multiple controlnet unions

Now, StableDiffusionXLControlNetUnionPipeline is instantiated with a list with two controlnets. Pose and segmentation conditions go to the first single controlnet and depth goes to the second controlnet. control_image is a list of lists of images and control_mode is a list of lists of int.

import torch

from diffusers import StableDiffusionXLControlNetUnionPipeline
from diffusers.models import ControlNetUnionModel, AutoencoderKL
from diffusers.utils import load_image

controlnet = ControlNetUnionModel.from_pretrained(
    "brad-twinkl/controlnet-union-sdxl-1.0-promax", torch_dtype=torch.float16
)

pipe = StableDiffusionXLControlNetUnionPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    controlnet=[controlnet, controlnet],
    vae = AutoencoderKL.from_pretrained(
        "madebyollin/sdxl-vae-fp16-fix", torch_dtype=torch.float16
    ),
    torch_dtype=torch.float16,
    variant="fp16",
)

seg_img = load_image("https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/sd_controlnet/room_seg.png")
seg_mode = 5

pose_img = load_image("https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/sd_controlnet/pose.png")
pose_mode = 0

depth_img = load_image("https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/sd_controlnet/stormtrooper_depth.png")
depth_mode = 1

pipe.enable_model_cpu_offload()

image = pipe(
    prompt="an astronaut in space",
    width=1024,
    height=1024,
    negative_prompt="lowres, low quality, worst quality",
    generator=torch.manual_seed(42),
    guidance_scale=5,
    num_inference_steps=50,
    control_image=[[pose_img, seg_img], [depth_img]],
    control_mode=[[pose_mode, seg_mode], [depth_mode]]
).images[0]

image.save("result_multi.jpg")

This is a dummy experience, but I'll still post the outputs here.

Single `ControlNetUnionModel` `MultiControlNetUnionModel`

What we could easily change is assume that if a single value is passed and controlnet is MultiControlNetUnionModel, convert that single element to a list. It's a bit different than what you suggested before, but the effect would be [[pose_img, seg_img], depth_img] transforms into [[pose_img, seg_img], [depth_img]]. The same for control modes. Is this closer to that you had in mind? @vladmandic

This would produce the effect you mentioned if you intend to pass a single condition for each controlnet union. If you passed [pose_img, seg_img, depth_img] to a single ControlNetUnionModel, all the conditions would go to that controlnet. But if you had MultiControlNetUnionModel, this input assumes you have 3 ControlNetUnionModels, and each would get one condition, equivalent to [[pose_img], [seg_img], [depth_img]], which I think was what you were aiming for.

I'm happy to iterate on this, let's just define what is expected and then execute on that :)

@vladmandic
Copy link
Contributor

i'm ok with leaving this as-is - i'll add a special case handler on my side.

@elismasilva
Copy link
Contributor

Hi @guiyrt I know it's late to share this now, but just to show how I did it in my pipeline. Inside the denoising loop:

 # controlnet(s) inference
if guess_mode and do_classifier_free_guidance:
    # Infer ControlNet only for the conditional batch.
    control_model_input = latents
    control_model_input = self.scheduler.scale_model_input(control_model_input, t)
    controlnet_prompt_embeds = prompt_embeds.chunk(2)[1]
    controlnet_added_cond_kwargs = {
        "text_embeds": add_text_embeds.chunk(2)[1],
        "time_ids": add_time_ids.chunk(2)[1],
    }
else:
    control_model_input = latent_model_input
    controlnet_prompt_embeds = prompt_embeds
    controlnet_added_cond_kwargs = added_cond_kwargs

if union_control_type is not None:
    # controlnet union index
    # 0 -- openpose
    # 1 -- depth
    # 2 -- hed/pidi/scribble/ted
    # 3 -- canny/lineart/anime_lineart/mlsd
    # 4 -- normal
    # 5 -- segment
    # 6 -- tile
    # 7 -- repaint
    union_controlnets = {k: v for k, v in enumerate(union_control_type)}

#reset blocks variable
down_block_res_samples = None
mid_block_res_sample = None
down_block_res_samples_list, mid_block_res_sample_list = [], [] 

if isinstance(self.controlnet, MultiControlNetModel): 
    total_controlnet=len(self.controlnet.nets)                           
    for control_index in range(total_controlnet):
        #set conditioning_scale
        if isinstance(controlnet_keep[i], list):
            cond_scale = [float(c * s) for c, s in zip(controlnet_conditioning_scale, [controlnet_keep[i][control_index]] * len(controlnet_conditioning_scale))]
        else:
            controlnet_cond_scale = controlnet_conditioning_scale
            if isinstance(controlnet_cond_scale, list):
                controlnet_cond_scale = controlnet_cond_scale[0]
            cond_scale = controlnet_cond_scale * controlnet_keep[i][control_index]

        if(isinstance(self.controlnet.nets[control_index], ControlNetModel)):             
            self.controlnet.nets[control_index] = self.controlnet.nets[control_index].to(_device) 
            down_block_res_samples, mid_block_res_sample = self.controlnet.nets[control_index](
                control_model_input,
                t,
                encoder_hidden_states=controlnet_prompt_embeds,
                controlnet_cond=control_image[control_index],
                conditioning_scale=cond_scale[control_index],
                guess_mode=guess_mode,
                added_cond_kwargs=controlnet_added_cond_kwargs,
                return_dict=False)

            # controlnet mask
            if (apply_control_masks[control_index]):                                                        
                if control_mask is not None and len(control_mask) > 0:
                    down_block_res_samples, mid_block_res_sample = self.apply_mask(control_mask[control_index], _device, dtype, down_block_res_samples, mid_block_res_sample)

            down_block_res_samples_list.append(down_block_res_samples)
            mid_block_res_sample_list.append(mid_block_res_sample)
            
            if self.controlnet.nets[control_index].device != "cpu": #release memory to next controlnet
                self.controlnet.nets[control_index] = self.controlnet.nets[control_index].to("cpu") 
            
        elif (isinstance(self.controlnet.nets[control_index], ControlNetModel_Union)):
            for k, v in union_controlnets.items():                                        
                if self.controlnet.nets[control_index].device != "cuda":
                    self.controlnet.nets[control_index] = self.controlnet.nets[control_index].to(_device)                                                                                                     
                new_control_type = [0] * 8
                controlnet_cond_list = [0] * 8
                new_control_type[k] = 1
                control_type = torch.Tensor(new_control_type)
                control_type = control_type.reshape(1, -1).to(_device, dtype=dtype).repeat(batch_size * num_images_per_prompt * (3 if self.do_perturbed_attention_guidance else 2), 1)                
                controlnet_cond_list[k] = control_image[k]

                added_cond_kwargs["control_type"]=control_type
                down_block_res_samples, mid_block_res_sample = self.controlnet.nets[control_index](
                    control_model_input,
                    t,
                    encoder_hidden_states=controlnet_prompt_embeds,
                    controlnet_cond_list=controlnet_cond_list,
                    conditioning_scale=cond_scale[k],
                    guess_mode=guess_mode,
                    added_cond_kwargs=controlnet_added_cond_kwargs,
                    return_dict=False,
                )
                # controlnet mask
                control_net_union_index = k
                if (apply_control_masks[control_net_union_index]):                     
                    if control_mask is not None and len(control_mask) > 0 and control_mask[control_net_union_index] is not None:
                        down_block_res_samples, mid_block_res_sample = self.apply_mask(control_mask[control_net_union_index], _device, dtype, down_block_res_samples, mid_block_res_sample)
                down_block_res_samples_list.append(down_block_res_samples)
                mid_block_res_sample_list.append(mid_block_res_sample)
                
            if self.controlnet.nets[control_index].device != "cpu": #release memory to next controlnet
                self.controlnet.nets[control_index] = self.controlnet.nets[control_index].to("cpu") 

    if mid_block_res_sample_list:
        mid_block_res_sample = torch.stack(mid_block_res_sample_list).sum(dim=0)

    if down_block_res_samples_list:
        down_block_res_samples = [torch.stack(down_block_res_samples).sum(dim=0) 
                                for down_block_res_samples in zip(*down_block_res_samples_list)]
    
else:
    #set conditioning_scale
    if isinstance(controlnet_keep[i], list):
        cond_scale = [float(c * s) for c, s in zip(controlnet_conditioning_scale, [controlnet_keep[i][control_index]] * len(controlnet_conditioning_scale))]
    else:
        controlnet_cond_scale = controlnet_conditioning_scale
        if isinstance(controlnet_cond_scale, list):
            controlnet_cond_scale = controlnet_cond_scale[0]
        cond_scale = controlnet_cond_scale * controlnet_keep[i]

    self.controlnet.to(_device)   
    controlnet_prompt_embeds = prompt_embeds

    down_block_res_samples, mid_block_res_sample = self.controlnet(
        control_model_input,
        t,
        encoder_hidden_states=controlnet_prompt_embeds,
        controlnet_cond=control_image[0],
        conditioning_scale=cond_scale,
        guess_mode=guess_mode,
        added_cond_kwargs=controlnet_added_cond_kwargs,
        return_dict=False,
    )

    # controlnet mask
    if apply_control_masks[0]:                
        if control_mask is not None and len(control_mask) > 0:
            down_block_res_samples, mid_block_res_sample = self.apply_mask(control_mask[0], _device, dtype, down_block_res_samples, mid_block_res_sample)

if guess_mode and do_classifier_free_guidance:
    # Infered ControlNet only for the conditional batch.
    # To apply the output of ControlNet to both the unconditional and conditional batches,
    # add 0 to the unconditional batch to keep it unchanged.
    down_block_res_samples = [torch.cat([torch.zeros_like(d), d]) for d in down_block_res_samples]
    mid_block_res_sample = torch.cat([torch.zeros_like(mid_block_res_sample), mid_block_res_sample])

if prompt_image_emb_ip is not None and len(prompt_image_emb_ip) > 0: 
    added_cond_kwargs["image_embeds"] = prompt_image_emb_ip                         

#noise_predict code ....

I receive this parameters for controlnet on call method:

control_image: PipelineImageInput = None, fixed list of 8 position Images.
control_mask = None, #here is optional, but if have apply_control_masks you need send fixed list mask images of 8 position
union_control_type = None,  # here is optional, but receives List[int] #total 8, 1 or 0
apply_control_masks: List[bool] = [],
controlnet_conditioning_scale: Union[float, List[float]] = 1.0,
control_guidance_start: Union[float, List[float]] = 0.0,
control_guidance_end: Union[float, List[float]] = 1.0,
guess_mode: bool = False,     

My implementation is bases on original code from controlnet_union and in InstantId pipepline with multicontrolnet.
and i only changed this on original MultiControlNetModel class:

def __init__(self, controlnets: Union[List[ControlNetModel], Tuple[ControlNetModel], List[ControlNetModel_Union], Tuple[ControlNetModel_Union]]):
        super().__init__()
        self.nets = nn.ModuleList(controlnets)

This way i can use ControlNetUnion with ControlNetModel. In practice i only send MultiControlnetClass even if i an setting a ControlNetMode on it, But you can send a MultiControlnetClass, ControlNetUnion or ControlNetModel on pipeline contronet attribute, like this:

self.controlnet: Union[ControlNetModel, List[ControlNetModel], Tuple[ControlNetModel], MultiControlNetModel] = None

If you are not using MultiControlnetClass, the params dont need be fixed lists, only append in list in same sequence that was loaded.

@john09282922
Copy link

Hi, I have issue when using ip-adapter, MultiContorlNetUnionModel does not have set_attn_processor.

thanks,

@guiyrt @hlky

@hlky
Copy link
Member

hlky commented Feb 13, 2025

@john09282922 Issues should be raised here, include a reproduction and the traceback.

@john09282922
Copy link

Is it possible to utilize ip-adapter with multicontrolnet_union? might be similar structure of mulitcontrolnet... not sure how to run it with ip-adapter?

@guiyrt @hlky @yiyixuxu

@asomoza
Copy link
Member

asomoza commented Feb 18, 2025

what do you mean? the pipeline has the IPAdapterMixin, you can just use it like any other SDXL pipeline with IP Adapters, there's nothing that changes in the method to use it.

@john09282922
Copy link

john09282922 commented Feb 18, 2025

what do you mean? the pipeline has the IPAdapterMixin, you can just use it like any other SDXL pipeline with IP Adapters, there's nothing that changes in the method to use it.

Hi, I got an issue when using inpainting model. I just changed current pipeline to inpaint pipeline, and when using multi-controlnet_union, there is set_att_process issue in multicontrolunion class. they don't have the process.

@asomoza
Copy link
Member

asomoza commented Feb 18, 2025

Hi, I got an issue when using inpainting model. I just changed current pipeline to inpaint pipeline, and when using multi-controlnet_union, there is set_att_process issue in multicontrolunion class. they don't have the process.

I see, the problem is not with IP Adapters, this PR only introduces MultiControlnetUnion to the StableDiffusionXLControlNetUnionPipeline so it won't work with any other controlnet pipeline.

@john09282922
Copy link

Hi, I got an issue when using inpainting model. I just changed current pipeline to inpaint pipeline, and when using multi-controlnet_union, there is set_att_process issue in multicontrolunion class. they don't have the process.

I see, the problem is not with IP Adapters, this PR only introduces MultiControlnetUnion to the StableDiffusionXLControlNetUnionPipeline so it won't work with any other controlnet pipeline.

okay, can you merge it to other controlnet pipeline?

@asomoza
Copy link
Member

asomoza commented Feb 18, 2025

okay, can you merge it to other controlnet pipeline?

We leave those kind of tasks to the community if they want to do it and there's a popular need for it, it should be relatively easy and you can open a feature request with it but I don't think that people will take it because of two points:

  • It's redundant since controlnet union has an inpainting mode.
  • you get a lot better results with it than with an inpainting pipeline and model.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants