-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MultiControlNetUnionModel
on SDXL
#10747
Conversation
Some points I had in my mind:
# Example, [2,5] -> (0,0,1,0,0,1,0,0)
if isinstance(controlnet, ControlNetUnionModel):
control_type = torch.zeros(controlnet.config.num_control_type).scatter(0, torch.tensor(control_mode), 1)
elif isinstance(controlnet, MultiControlNetUnionModel):
control_type = [
torch.zeros(controlnet_.config.num_control_type).scatter(0, torch.tensor(control_mode_), 1)
for control_mode_, controlnet_ in zip(control_mode, self.controlnet.nets)
] |
MultiControlNetUnionModel
on SDXLMultiControlNetUnionModel
on SDXL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
control_image
/image
is the standard naming, we'd like to keep it consistent across pipelines- That would be great
- We have that in the pipeline to avoid re-computing it for every sampling step. The original had it in the model and used
nonzero
which caused a cuda sync every step. We changed it to avoid the sync, usingscatter
here is nice though, looks like we can usescatter_
to be in-place.
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
Co-authored-by: hlky <[email protected]>
Co-authored-by: hlky <[email protected]>
👍, was checking because of #10131 (comment)
👍, will add new test class
That makes sense, keeping it as is and updating to use _scatter instead |
@guiyrt, Hi, thanks for awesome work, How can I use multi-controlnet-union? is it code? and also can you give me example for each condition scale? |
@hlky feel free to merge once it looks good to you! |
@guiyrt Can you check the output of |
Yep, I'll post it here in a sec. I'm working on the tests, one of the issues is related to |
Hi @john09282922, example inference code is above in a dropdown of the PR description, but I'll paste here again :). To change condition scale, start and end, you just need to pass them as you would normally, but now in a list. For example, you now pass Inference codeimport torch
from diffusers import StableDiffusionXLControlNetUnionPipeline
from diffusers.models import ControlNetUnionModel, AutoencoderKL
from diffusers.utils import load_image
model_id = "stabilityai/stable-diffusion-xl-base-1.0"
controlnet_id = "brad-twinkl/controlnet-union-sdxl-1.0-promax"
controlnet = ControlNetUnionModel.from_pretrained(
"brad-twinkl/controlnet-union-sdxl-1.0-promax", torch_dtype=torch.float16
)
pipe = StableDiffusionXLControlNetUnionPipeline.from_pretrained(
"stabilityai/stable-diffusion-xl-base-1.0",
controlnet=[controlnet, controlnet],
vae = AutoencoderKL.from_pretrained(
"madebyollin/sdxl-vae-fp16-fix", torch_dtype=torch.float16
),
torch_dtype=torch.float16,
variant="fp16",
)
room_seg_img = load_image("https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/sd_controlnet/room_seg.png")
pose_img = load_image("https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/sd_controlnet/pose.png")
pipe.enable_model_cpu_offload()
image = pipe(
prompt="an astronaut in space",
width=1024,
height=1024,
negative_prompt="lowres, low quality, worst quality",
generator=torch.manual_seed(42),
guidance_scale=5,
num_inference_steps=50,
control_image=[[pose_img], [room_seg_img]],
control_mode=[[0], [5]]
).images[0]
image.save("result.jpg") |
@hlky Corporate needs you to find the differences between this picture and this picture
Inference codeimport torch
from diffusers import StableDiffusionXLControlNetUnionPipeline
from diffusers.models import ControlNetUnionModel, AutoencoderKL
from diffusers.utils import load_image
model_id = "stabilityai/stable-diffusion-xl-base-1.0"
controlnet = ControlNetUnionModel.from_pretrained(
"brad-twinkl/controlnet-union-sdxl-1.0-promax", torch_dtype=torch.float16
)
pipe = StableDiffusionXLControlNetUnionPipeline.from_pretrained(
"stabilityai/stable-diffusion-xl-base-1.0",
controlnet=controlnet,
vae = AutoencoderKL.from_pretrained(
"madebyollin/sdxl-vae-fp16-fix", torch_dtype=torch.float16
),
torch_dtype=torch.float16,
variant="fp16",
)
room_seg_img = load_image("https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/sd_controlnet/room_seg.png")
pose_img = load_image("https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/sd_controlnet/pose.png")
pipe.enable_model_cpu_offload()
image = pipe(
prompt="an astronaut in a space station",
width=1024,
height=1024,
negative_prompt="lowres, low quality, worst quality",
generator=torch.manual_seed(42),
guidance_scale=5,
num_inference_steps=50,
control_image=[pose_img, room_seg_img],
control_mode=[0, 5],
).images[0]
image.save("result_main.jpg") |
Thanks @guiyrt. Just checking output hasn't changed, I think we should be expecting the output to be the same between cc @asomoza we're using control images of different resolutions here, and not resizing to the generation resolution, in your testing of ControlNetUnion does this affect the result? Do you have any recommendations for two control images that are known to work together? |
It works well for each condition, but to get the best results using both might require trying different
The control images are passed by the
|
I'm not particularly fond of the auto resizing of the images for controlnet, the good controlnets are really affected by bad resizings or resolutions, so I would like it more if it throws an error instead of auto scaling the images specially if it messes up the aspect ratio, but that's another issue and not something for this PR. @hlky, using a real scenario and with real images with the correct sizes and also with the correct resolution for the conditioning images (they have resolutions with the preprocessors too), one common scenario is to use a depth map with another one of the edges or lines ones, I like to use teed with a special combination I learned from anyline. This combination makes it really easy to test because if you want a good image you'll need to lower the scales and the guidance ends, so you can tell when they're not working good together. This is my test with it:
In this case, the depth map is the one that gives the overall scene composition and lighting and the teed one is the one that adds the details, specially for the river waves and the background. How does it compare to a single controlnet in main? I can test it with specific conditioning scales and ends for each one, so testing with both conditions at 1.0:
They're different but in theory they should produce the same result. To test this, I tested each condition separately with the single controlnet union:
There's definitely something going on here, but if you ask me, I think and like more the result in the multicontrolnet from this PR than in the original single one. Also I mostly use it with multricontrolnets because I like to control each one with a different guidance scale and end. To me it seems that the single controlnet with multiple conditions is not applying that much the depth condition and takes more into account the teed one. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @guiyrt
I agree on this, is there something in place for the other controlnet pipelines regarding preserving aspect ratio? Otherwise, if you find value on this, I could work on something.
I don't think an equal result is expected from "single controlnet multi conditions" and "multi controlnet". From the standpoint of controlnet inputs, the control embedding is different, and the fused control condition is also different, right? The output should be similar, but not equal. |
yeah, probably word it bad, that's what I meant, they're different when comparing them, and as I wrote before, I don't use the multiple condition in the same controlnet union except for testing. I'll do some more testings later. Also I wouldn't make this something that important or a blocker, controlnet union it's really weird and cool at the same time, you can even mix the conditions in one image and it will still work, also you can use conditions with other control types to get some interesting results and most of the time they will still work. |
control_image=[[pose_img], [room_seg_img]],
control_mode=[[0], [5]] why are those now list-of-lists? |
isn't it because this is MultiControlNetUnion where each controlnet (union) accepts a list of images? If I understand correctly what you're saying, you're suggesting that we make the controlnetunion a single image controlnet when used with multi controlnets? |
yes, exactly.
doing multi-of-multi is not something that can be effectively used - and makes assembling correct params a complete nightmare and non-standard with any other controlnet. |
@guiyrt If you apply a control mask to control pose you can improve segment control net, see this:
I am using conditioning scale, see: I changed prompt to: an astronaut in space, inside spaceship and steps only 30 steps. PS: Iam not using your pipeline because I had already implemented the use of controlnet union in my pipeline before the official publication, but I believe the result will be the same in your case, what should help is the control mask. |
I looked into the Flux implementation for context, and it is indeed implemented differently. From what I got, there is no
I agree with you that in the current state, the list-of-lists input is confusing if you have many conditions or many controlnets, but I also don't think we should disregard the usage of multi-condition single execution if a controlnet union supports it. After giving some thought, maybe the ideal scenario would be the following:
Comparing to current version, this would have the benefits of simplified interface for The more complex interface comes from And I also think there we wouldn't need separate controlnet_union pipelines, such as |
@vladmandic You requested this use case, @guiyrt has very kindly taken up MultiControlNetUnion and I've done experiments to use scale per condition without MultiControlNetUnion, #10723. Yes, it is a different interface, it is a unique model, all we need to do to handle this in any integration is an simple if else. |
@hlky @guiyrt i'll make it work and i definitely do appreciate the work, i'm just worried about lack of standardization which significantly increases complexities for end user. if isinstance(control_mode, list) and isinstance(control_mode[0], float)
control_mode=[[x] for x in control_mode]
control_image=[[x] for x in control_image] |
If you have a single controlnet union, you can still pass a list of control images and list of control modes. But if you have more than one controlnet, there is no way around list of lists, otherwise we couldn't infer which input would go for which controlnet (assuming you want to pass multiple condition to a single controlnet union). I'll put here an example using three conditions, and the differences in using one and two controlnets. Single controlnet union
import torch
from diffusers import StableDiffusionXLControlNetUnionPipeline
from diffusers.models import ControlNetUnionModel, AutoencoderKL
from diffusers.utils import load_image
controlnet = ControlNetUnionModel.from_pretrained(
"brad-twinkl/controlnet-union-sdxl-1.0-promax", torch_dtype=torch.float16
)
pipe = StableDiffusionXLControlNetUnionPipeline.from_pretrained(
"stabilityai/stable-diffusion-xl-base-1.0",
controlnet=controlnet,
vae = AutoencoderKL.from_pretrained(
"madebyollin/sdxl-vae-fp16-fix", torch_dtype=torch.float16
),
torch_dtype=torch.float16,
variant="fp16",
)
seg_img = load_image("https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/sd_controlnet/room_seg.png")
seg_mode = 5
pose_img = load_image("https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/sd_controlnet/pose.png")
pose_mode = 0
depth_img = load_image("https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/sd_controlnet/stormtrooper_depth.png")
depth_mode = 1
pipe.enable_model_cpu_offload()
image = pipe(
prompt="an astronaut in space",
width=1024,
height=1024,
negative_prompt="lowres, low quality, worst quality",
generator=torch.manual_seed(42),
guidance_scale=5,
num_inference_steps=50,
control_image=[pose_img, seg_img, depth_img],
control_mode=[pose_mode, seg_mode, depth_mode]
).images[0]
image.save("result.jpg") Multiple controlnet unionsNow, import torch
from diffusers import StableDiffusionXLControlNetUnionPipeline
from diffusers.models import ControlNetUnionModel, AutoencoderKL
from diffusers.utils import load_image
controlnet = ControlNetUnionModel.from_pretrained(
"brad-twinkl/controlnet-union-sdxl-1.0-promax", torch_dtype=torch.float16
)
pipe = StableDiffusionXLControlNetUnionPipeline.from_pretrained(
"stabilityai/stable-diffusion-xl-base-1.0",
controlnet=[controlnet, controlnet],
vae = AutoencoderKL.from_pretrained(
"madebyollin/sdxl-vae-fp16-fix", torch_dtype=torch.float16
),
torch_dtype=torch.float16,
variant="fp16",
)
seg_img = load_image("https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/sd_controlnet/room_seg.png")
seg_mode = 5
pose_img = load_image("https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/sd_controlnet/pose.png")
pose_mode = 0
depth_img = load_image("https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/sd_controlnet/stormtrooper_depth.png")
depth_mode = 1
pipe.enable_model_cpu_offload()
image = pipe(
prompt="an astronaut in space",
width=1024,
height=1024,
negative_prompt="lowres, low quality, worst quality",
generator=torch.manual_seed(42),
guidance_scale=5,
num_inference_steps=50,
control_image=[[pose_img, seg_img], [depth_img]],
control_mode=[[pose_mode, seg_mode], [depth_mode]]
).images[0]
image.save("result_multi.jpg") This is a dummy experience, but I'll still post the outputs here.
What we could easily change is assume that if a single value is passed and controlnet is This would produce the effect you mentioned if you intend to pass a single condition for each controlnet union. If you passed I'm happy to iterate on this, let's just define what is expected and then execute on that :) |
i'm ok with leaving this as-is - i'll add a special case handler on my side. |
Hi @guiyrt I know it's late to share this now, but just to show how I did it in my pipeline. Inside the denoising loop: # controlnet(s) inference
if guess_mode and do_classifier_free_guidance:
# Infer ControlNet only for the conditional batch.
control_model_input = latents
control_model_input = self.scheduler.scale_model_input(control_model_input, t)
controlnet_prompt_embeds = prompt_embeds.chunk(2)[1]
controlnet_added_cond_kwargs = {
"text_embeds": add_text_embeds.chunk(2)[1],
"time_ids": add_time_ids.chunk(2)[1],
}
else:
control_model_input = latent_model_input
controlnet_prompt_embeds = prompt_embeds
controlnet_added_cond_kwargs = added_cond_kwargs
if union_control_type is not None:
# controlnet union index
# 0 -- openpose
# 1 -- depth
# 2 -- hed/pidi/scribble/ted
# 3 -- canny/lineart/anime_lineart/mlsd
# 4 -- normal
# 5 -- segment
# 6 -- tile
# 7 -- repaint
union_controlnets = {k: v for k, v in enumerate(union_control_type)}
#reset blocks variable
down_block_res_samples = None
mid_block_res_sample = None
down_block_res_samples_list, mid_block_res_sample_list = [], []
if isinstance(self.controlnet, MultiControlNetModel):
total_controlnet=len(self.controlnet.nets)
for control_index in range(total_controlnet):
#set conditioning_scale
if isinstance(controlnet_keep[i], list):
cond_scale = [float(c * s) for c, s in zip(controlnet_conditioning_scale, [controlnet_keep[i][control_index]] * len(controlnet_conditioning_scale))]
else:
controlnet_cond_scale = controlnet_conditioning_scale
if isinstance(controlnet_cond_scale, list):
controlnet_cond_scale = controlnet_cond_scale[0]
cond_scale = controlnet_cond_scale * controlnet_keep[i][control_index]
if(isinstance(self.controlnet.nets[control_index], ControlNetModel)):
self.controlnet.nets[control_index] = self.controlnet.nets[control_index].to(_device)
down_block_res_samples, mid_block_res_sample = self.controlnet.nets[control_index](
control_model_input,
t,
encoder_hidden_states=controlnet_prompt_embeds,
controlnet_cond=control_image[control_index],
conditioning_scale=cond_scale[control_index],
guess_mode=guess_mode,
added_cond_kwargs=controlnet_added_cond_kwargs,
return_dict=False)
# controlnet mask
if (apply_control_masks[control_index]):
if control_mask is not None and len(control_mask) > 0:
down_block_res_samples, mid_block_res_sample = self.apply_mask(control_mask[control_index], _device, dtype, down_block_res_samples, mid_block_res_sample)
down_block_res_samples_list.append(down_block_res_samples)
mid_block_res_sample_list.append(mid_block_res_sample)
if self.controlnet.nets[control_index].device != "cpu": #release memory to next controlnet
self.controlnet.nets[control_index] = self.controlnet.nets[control_index].to("cpu")
elif (isinstance(self.controlnet.nets[control_index], ControlNetModel_Union)):
for k, v in union_controlnets.items():
if self.controlnet.nets[control_index].device != "cuda":
self.controlnet.nets[control_index] = self.controlnet.nets[control_index].to(_device)
new_control_type = [0] * 8
controlnet_cond_list = [0] * 8
new_control_type[k] = 1
control_type = torch.Tensor(new_control_type)
control_type = control_type.reshape(1, -1).to(_device, dtype=dtype).repeat(batch_size * num_images_per_prompt * (3 if self.do_perturbed_attention_guidance else 2), 1)
controlnet_cond_list[k] = control_image[k]
added_cond_kwargs["control_type"]=control_type
down_block_res_samples, mid_block_res_sample = self.controlnet.nets[control_index](
control_model_input,
t,
encoder_hidden_states=controlnet_prompt_embeds,
controlnet_cond_list=controlnet_cond_list,
conditioning_scale=cond_scale[k],
guess_mode=guess_mode,
added_cond_kwargs=controlnet_added_cond_kwargs,
return_dict=False,
)
# controlnet mask
control_net_union_index = k
if (apply_control_masks[control_net_union_index]):
if control_mask is not None and len(control_mask) > 0 and control_mask[control_net_union_index] is not None:
down_block_res_samples, mid_block_res_sample = self.apply_mask(control_mask[control_net_union_index], _device, dtype, down_block_res_samples, mid_block_res_sample)
down_block_res_samples_list.append(down_block_res_samples)
mid_block_res_sample_list.append(mid_block_res_sample)
if self.controlnet.nets[control_index].device != "cpu": #release memory to next controlnet
self.controlnet.nets[control_index] = self.controlnet.nets[control_index].to("cpu")
if mid_block_res_sample_list:
mid_block_res_sample = torch.stack(mid_block_res_sample_list).sum(dim=0)
if down_block_res_samples_list:
down_block_res_samples = [torch.stack(down_block_res_samples).sum(dim=0)
for down_block_res_samples in zip(*down_block_res_samples_list)]
else:
#set conditioning_scale
if isinstance(controlnet_keep[i], list):
cond_scale = [float(c * s) for c, s in zip(controlnet_conditioning_scale, [controlnet_keep[i][control_index]] * len(controlnet_conditioning_scale))]
else:
controlnet_cond_scale = controlnet_conditioning_scale
if isinstance(controlnet_cond_scale, list):
controlnet_cond_scale = controlnet_cond_scale[0]
cond_scale = controlnet_cond_scale * controlnet_keep[i]
self.controlnet.to(_device)
controlnet_prompt_embeds = prompt_embeds
down_block_res_samples, mid_block_res_sample = self.controlnet(
control_model_input,
t,
encoder_hidden_states=controlnet_prompt_embeds,
controlnet_cond=control_image[0],
conditioning_scale=cond_scale,
guess_mode=guess_mode,
added_cond_kwargs=controlnet_added_cond_kwargs,
return_dict=False,
)
# controlnet mask
if apply_control_masks[0]:
if control_mask is not None and len(control_mask) > 0:
down_block_res_samples, mid_block_res_sample = self.apply_mask(control_mask[0], _device, dtype, down_block_res_samples, mid_block_res_sample)
if guess_mode and do_classifier_free_guidance:
# Infered ControlNet only for the conditional batch.
# To apply the output of ControlNet to both the unconditional and conditional batches,
# add 0 to the unconditional batch to keep it unchanged.
down_block_res_samples = [torch.cat([torch.zeros_like(d), d]) for d in down_block_res_samples]
mid_block_res_sample = torch.cat([torch.zeros_like(mid_block_res_sample), mid_block_res_sample])
if prompt_image_emb_ip is not None and len(prompt_image_emb_ip) > 0:
added_cond_kwargs["image_embeds"] = prompt_image_emb_ip
#noise_predict code .... I receive this parameters for controlnet on call method: control_image: PipelineImageInput = None, fixed list of 8 position Images.
control_mask = None, #here is optional, but if have apply_control_masks you need send fixed list mask images of 8 position
union_control_type = None, # here is optional, but receives List[int] #total 8, 1 or 0
apply_control_masks: List[bool] = [],
controlnet_conditioning_scale: Union[float, List[float]] = 1.0,
control_guidance_start: Union[float, List[float]] = 0.0,
control_guidance_end: Union[float, List[float]] = 1.0,
guess_mode: bool = False, My implementation is bases on original code from controlnet_union and in InstantId pipepline with multicontrolnet. def __init__(self, controlnets: Union[List[ControlNetModel], Tuple[ControlNetModel], List[ControlNetModel_Union], Tuple[ControlNetModel_Union]]):
super().__init__()
self.nets = nn.ModuleList(controlnets) This way i can use ControlNetUnion with ControlNetModel. In practice i only send MultiControlnetClass even if i an setting a ControlNetMode on it, But you can send a MultiControlnetClass, ControlNetUnion or ControlNetModel on pipeline contronet attribute, like this: self.controlnet: Union[ControlNetModel, List[ControlNetModel], Tuple[ControlNetModel], MultiControlNetModel] = None If you are not using MultiControlnetClass, the params dont need be fixed lists, only append in list in same sequence that was loaded. |
@john09282922 Issues should be raised here, include a reproduction and the traceback. |
what do you mean? the pipeline has the IPAdapterMixin, you can just use it like any other SDXL pipeline with IP Adapters, there's nothing that changes in the method to use it. |
Hi, I got an issue when using inpainting model. I just changed current pipeline to inpaint pipeline, and when using multi-controlnet_union, there is set_att_process issue in multicontrolunion class. they don't have the process. |
I see, the problem is not with IP Adapters, this PR only introduces |
okay, can you merge it to other controlnet pipeline? |
We leave those kind of tasks to the community if they want to do it and there's a popular need for it, it should be relatively easy and you can open a
|
What does this PR do?
New
MultiControlNetUnionModel
wrapper class to handle multipleControlNetUnionModel
s, similarly toMultiControlNetModel
. Addressed in #10656 to control start, end and scale of each condition image.Input
Inference code
First, I ran the pipeline as before, using a single
ControlNetUnionModel
with pose, segmentation and pose+segmentation conditions, to have outputs to compare with.ControlNetUnionModel
MultiControlNetUnionModel
Two instances of
ControlNetUnionModel
, one got the segmentation conditioning and the other the pose. To compare the output, I setcontrolnet_conditioning_scale
to [0.0, 1.0] and [0.0, 1.0] to compare with the output of single conditioning usingControlNetUnionModel
above. As we see (and expect), these outputs are the same. The output is different when using both segmentation and pose conditioning, which I think is expected.Before submitting
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
@hlky @yiyixuxu @vladmandic @asomoza