-
Notifications
You must be signed in to change notification settings - Fork 6.1k
[WIP]correct the attn naming for UNet3DConditionModel
#6873
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
@yiyixuxu See my comment here regarding this I think we would want to configure everything to use Here's a draft PR with what I'm proposing |
I think another option is:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I think this PR is very nice if we never use num_attention_heads
in the config.json of our 3D models - can we confirm this?
These configs seem to (correctly) only use attention_head_dim
:
- https://huggingface.co/cerspense/zeroscope_v2_576w/blob/6963642a64dbefa93663d1ecebb4ceda2d9ecb28/unet/config.json#L6
- https://huggingface.co/hotshotco/Hotshot-XL/blob/300d6a691ab6d62e74348f2e0d430e6d82ee2864/unet/config.json#L44
- https://huggingface.co/ali-vilab/text-to-video-ms-1.7b/blob/8227dddca75a8561bf858d604cc5dae52b954d01/unet/config.json#L6
=> So we should be good here!
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
closing in favor of #6893 |
why do I open this PR?
The motivation for this PR is that I find it impossible to reason about how to config a 3D Unet using
CrossAttnDownBlock3D
,CrossAttnUpBlock3D
andUNetMidBlock3DCrossAttn
- these 3 blocks all have an argumentnum_attention_heads
which expects the value ofattention_head_dim
.for example this is the
__init__
method ofCrossAttnDownBlock3D
It is not really obvious that the
num_attention_heads
are supposed to beattention_head_dim
. Only if you look closely at how it createsTransformer2DModel
and then check againstTransformer2DModel
's signature, you will notice that thenum_attention_heads
here are passed asattention_head_dim
toTransformer2DModel
and then passed all the way down toAttention
ashead_dim
All our text-to-video UNets are configured with
attention_head_dim
, for example this one hasattention_head_dim = 64
so what we did in
UNet3DConditionModel
isattention_head_dim
tonum_attention_head
, i.e. insideUNet3DConditionModel.__init__
, we donum_attention_head=attention_head_dim
64
around in the name ofnum_attention_heads
:get_down_block(num_attention_heads = num_attention_heads)
get_down_block
then callCrossAttnDownBlock3D(num_attention_heads=num_attention_heads
)CrossAttnDownBlock3D
callsTransformer2DModel
andTransformerTemporalModel
It took me so much efforts to figure out what's going on and and I'm still confused. I know this is introduced in this PR #3797 but I'm not sure why.
Unlike Unet2D models, 3D models never really have the "wrong configuration" problem - they are configured with
attention_head_dim
instead ofnum_attention_heads
, but it wan't "wrong". i.e.attention_head_dim = 64
actually meansattention_head_dim = 64
. This is different from Unet2D models, I'm aware that the Unet2D has the wrong configuration issue: theattention_head_dim
in their config file should benum_attention_heads
; and I'm aware that's why we had to assignattention_head_dim
tonum_attention_heads
for Unet2D. But this is not the case for 3D thoughDid we do it this way so that all the cross-attention blocks can only accept one argument,
num_attention_heads
, instead of two different arguments? If so, I would argue that even though it is not ideal, it is still better than the current arrangement: with current arrangement I find it very difficult to reason about it and even harder to explain to other people 😭😭😭So, I tried to correct the argument names and deprecate things in this PR. I'm curious why we did it this way (I think it's very likely I missed something). And I'm very much open to any other solution that can make this confusion go away:)
test
I will run the slow test, but here is a quick sanity check to make sure that we are able to config the
heads
andhead_dim
parameters inAttention
class correctly.for 2d unet model, we config with
num_attention_heads = 8
for 3d unet models, we config with
attention_head_dim=64
, results are expected