-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CogView4 (supports different length c and uc) #10649
base: main
Are you sure you want to change the base?
Conversation
Implement the basic CogView4 pipeline structure with the following changes: - Add CogView4 pipeline implementation - Implement DDIM scheduler for CogView4 - Add CogView3Plus transformer architecture - Update embedding models Current limitations: - CFG implementation uses padding for sequence length alignment - Need to verify transformer inference alignment with Megatron TODO: - Consider separate forward passes for condition/uncondition instead of padding approach
…n CogView4 pipeline Split the forward pass for conditional and unconditional predictions in the CogView4 pipeline to match the original implementation. The noise prediction is now done separately for each case before combining them for guidance. However, the results still need improvement. This is a work in progress as the generated images are not yet matching expected quality.
@zRzRzRzRzRzRzR @OleehyO Thanks for the PR! I'm going to take a look soon and try to help with debugging Megatron -> Diffusers. I see some additional changes made to the files that are not relevant to CogView (currently more than 200 files have been changed 😅). Could you revert those changes by doing something like |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PR looks good to merge from the modeling and pipeline perspective. For scheduler review, will defer to @yiyixuxu. I think we don't need a specific scheduler for CogView4 from a quick look but YiYi has a better understanding of it.
![]() |
The outputs don't match to the values (numerically. they still match visually) before I took over with the latest changes. The differences are because:
- RoPE in float32
- latents are always in float32 and casted to transformer dtype before forward pass. Previously, it was bfloat16 (latent creation + transformer forward) -> float32 (scheduler step) -> bfloat16 again (transformer forward)
- Changed how multi_modulate and multi_gate worked. This should practically have no difference in outputs, but seems like there's a difference on order of
1e-4
due to not performing the layer norm on concatenatedencoder_hidden_states
andhidden_states
.
The reason for two forward passes in the pipeline is because the conditioning embeds and unconditioning embeds can have different sequence lengths.
The model for testing is available here (private): https://huggingface.co/ZP2HF/CogView4-6B-0125. This has the weights in the same format as the model that will be released next week. BUT, it is not the final weights based on my conversation with Yuxuan and an improved version will be released.
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good to me! I think we still need to see if we can use existing scheduler though
from .scheduling_utils import SchedulerMixin | ||
|
||
|
||
class CogViewScheduler(SchedulerMixin, ConfigMixin): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is still not resolved, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yiyixuxu I think we should be good to merge. Could you review the scheduler related changes? I've mentioned the relevant parts
With these changes, we match the implementation before I took over with the latest updates. Outputs are identical
@@ -98,13 +98,17 @@ def __init__( | |||
use_karras_sigmas: Optional[bool] = False, | |||
use_exponential_sigmas: Optional[bool] = False, | |||
use_beta_sigmas: Optional[bool] = False, | |||
time_shift_type: str = "exponential", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need this new parameter because CogView4 performs timestep shifting without first exponentiating mu
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @hlky here too maybe if you think there's a better way to handle this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is ok IMO, will let @yiyixuxu comment as well. We'll have 3 types of shifting (non-dynamic, dynamic with exponential shift, dynamic with linear shift) after this, so it can be looked at with the scheduler refactor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good to me!
timesteps = ( | ||
np.linspace(self.scheduler.config.num_train_timesteps, 1.0, num_inference_steps) | ||
if timesteps is None | ||
else np.array(timesteps) | ||
) | ||
timesteps = timesteps.astype(np.int64) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit different from what we usually do. We don't use self.scheduler.timesteps
returned from the call to retrieve_timesteps
because those timesteps are from after resolution-based timestep shifting is applied. For CogView4, it seems like we need to use the timesteps from before applying shifting, but sigmas from after applying shifting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's really werid - but if it's on purpose (not a oversight or something), we need support it from retrieve_timesteps
and also the set_timesteps
method from scheduler to accept both custom timesteps and custom sigmas. Either that or maybe add an option to calculate timesteps based on the sigmas pre-shiting
otherwise I don't think it would function correctly for img2img or training, where you do not start from the first timestep and need to search against self.scheduler.timesteps
e.g. this function won't work
def index_for_timestep(self, timestep, schedule_timesteps=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't have access to the original codebase yet, so it will be hard to check if it's an oversight. It is weird that we have to do it this way, but if we don't do it (that is having sigmas corresponding to timesteps), the final outputs come out with some residual noise.
Also seems like in my latest update I made a mistake doing timesteps.astype(np.float32)
from some local testing. Basically, we want integer timesteps here first (to round down the float values from linspace), but then need float32 timesteps for our scheduler to not raise an error:
diffusers/src/diffusers/schedulers/scheduling_flow_match_euler_discrete.py
Lines 366 to 372 in a0c2299
raise ValueError( | |
( | |
"Passing integer indices (e.g. from `enumerate(timesteps)`) as timesteps to" | |
" `EulerDiscreteScheduler.step()` is not supported. Make sure to pass" | |
" one of the `scheduler.timesteps` as a timestep." | |
), | |
) |
So, it will have to be something like timesteps.astype(np.int64).astype(np.float32)
to be consistent with the behaviour when we started updating the PR and to not error out in our scheduler
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could return timesteps without shifting, then apply shifting on the fly in scheduler.step?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it will have to be something like timesteps.astype(np.int64).astype(np.float32)
ok, it seems like custom timesteps might be the way to go because this logic here is just really custom (even if we calculate the timesteps without shifting, we also need to do this round up thing first)
basically, you need to:
- remove the ValueError about passing sigma and timesteps at the same time
- add
timesteps
toset_timesteps
timesteps: Optional[List[int]] = None,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll do some more testing with a fresh mind in the morning to verify if we need differing timesteps vs sigmas here (to recheck if it is a possible oversight or not). I do think I did everything correctly when I tried earlier today, and as a result we might need to do what you mentioned, but wouldn't hurt to delay a little longer and verify again if it might help save us a bunch of changes.
@zRzRzRzRzRzRzR If it would be possible to share just the scheduler implemention related files with us, it would really help us understand if changes are required. No problem if not :) We can wait for the official release from THUDM and update our implementation
scheduler = FlowMatchEulerDiscreteScheduler( | ||
base_shift=0.25, max_shift=0.75, base_image_seq_len=256, use_dynamic_shifting=True, time_shift_type="linear" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zRzRzRzRzRzRzR The converted checkpoints will have to also update the scheduler config (since we're no longer using the CogView4DDIMScheduler
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have checked, the modification in this part is understandable.
scheduler = FlowMatchEulerDiscreteScheduler( | ||
base_shift=0.25, max_shift=0.75, base_image_seq_len=256, use_dynamic_shifting=True, time_shift_type="linear" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zRzRzRzRzRzRzR Same comment as above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on the current algorithm comparison and the images produced in practice, this change seems to be functioning properly.
Co-authored-by: YiYi Xu <[email protected]>
What does this PR do?
This PR aims to support the adaptation of the CogView with c and uc different length for diffusers.
We have reproduced the algorithm implementation, but this PR still requires further refinement. Currently, the output is pure green noise, so this PR remains in draft status and requires help from @a-r-r-o-w and @yiyixuxu.