-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cosmos #10660
base: main
Are you sure you want to change the base?
Cosmos #10660
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
To match our sigmas to original exactly, without any rounding errors, I had to use
Also, we only match the sigmas if we set our |
The latest push makes it so that the Video2World models can run end-to-end with diffusers. The T2W pipeline produces good outputs but V2W pipeline still generates garbage -- I'm debugging it. I've matched the transformers for both T2W and V2W though (everything is matching and I've updated the description with test code), so the bug is most likely in the pipeline implementation. |
for i, t in enumerate(timesteps): | ||
if self.interrupt: | ||
continue | ||
|
||
self._current_timestep = t | ||
timestep = t.expand(latents.shape[0]).to(transformer_dtype) | ||
|
||
current_sigma = self.scheduler.sigmas[i] | ||
is_augment_sigma_greater = augment_sigma >= current_sigma | ||
|
||
current_cond_indicator = cond_indicator * 0 if is_augment_sigma_greater else cond_indicator | ||
cond_noise = randn_tensor(latents.shape, generator=generator, device=device, dtype=torch.float32) | ||
cond_latent = conditioning_latents + cond_noise * augment_sigma[:, None, None, None, None] | ||
cond_latent = current_cond_indicator * cond_latent + (1 - current_cond_indicator) * latents | ||
cond_latent = self.scheduler.scale_model_input(cond_latent, t) | ||
cond_latent = cond_latent.to(transformer_dtype) | ||
|
||
noise_pred = self.transformer( | ||
hidden_states=cond_latent, | ||
timestep=timestep, | ||
encoder_hidden_states=prompt_embeds, | ||
fps=fps, | ||
condition_mask=cond_mask, | ||
padding_mask=padding_mask, | ||
return_dict=False, | ||
)[0] | ||
|
||
if self.do_classifier_free_guidance: | ||
current_uncond_indicator = uncond_indicator * 0 if is_augment_sigma_greater else uncond_indicator | ||
uncond_noise = randn_tensor(latents.shape, generator=generator, device=device, dtype=torch.float32) | ||
uncond_latent = conditioning_latents + uncond_noise * augment_sigma[:, None, None, None, None] | ||
uncond_latent = current_uncond_indicator * uncond_latent + (1 - current_uncond_indicator) * latents | ||
uncond_latent = self.scheduler.scale_model_input(uncond_latent, t) | ||
uncond_latent = uncond_latent.to(transformer_dtype) | ||
|
||
noise_pred_uncond = self.transformer( | ||
hidden_states=uncond_latent, | ||
timestep=timestep, | ||
encoder_hidden_states=negative_prompt_embeds, | ||
fps=fps, | ||
condition_mask=uncond_mask, | ||
padding_mask=padding_mask, | ||
return_dict=False, | ||
)[0] | ||
noise_pred = torch.cat([noise_pred_uncond, noise_pred]) | ||
|
||
# pred_original_sample (x0) | ||
noise_pred = self.scheduler.step(noise_pred, t, latents, return_dict=False)[1] | ||
self.scheduler._step_index -= 1 | ||
|
||
if self.do_classifier_free_guidance: | ||
noise_pred_uncond, noise_pred_cond = noise_pred.chunk(2, dim=0) | ||
noise_pred_uncond = ( | ||
current_uncond_indicator * conditioning_latents | ||
+ (1 - current_uncond_indicator) * noise_pred_uncond | ||
) | ||
noise_pred_cond = ( | ||
current_cond_indicator * conditioning_latents + (1 - current_cond_indicator) * noise_pred_cond | ||
) | ||
noise_pred = noise_pred_cond + self.guidance_scale * (noise_pred_cond - noise_pred_uncond) | ||
else: | ||
noise_pred = ( | ||
current_cond_indicator * conditioning_latents + (1 - current_cond_indicator) * noise_pred | ||
) | ||
|
||
# pred_sample (eps) | ||
latents = self.scheduler.step( | ||
noise_pred, t, latents, return_dict=False, pred_original_sample=noise_pred | ||
)[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not too happy about this hack around our scheduler, but this seems like the only way to make it work (the outputs are atleast no longer random garbage; following the conditioning is still not fixed).
The original code seems to be applying CFG on the x0-prediction, followed by obtaining the eps-prediction. I've made the same change related to CFG on x0-pred in the Text-to-World pipeline as well. Would be great if a second set of eyes wanted to give it a look but I do think the implementation regarding is right. The hack around scheduler._step_index
is necessary to make sure we can compute the eps-pred using the augmented x0-pred.
Here's the relevant code:
- Text-To-World x0 function that performs x0-pred + CFG: https://github.com/NVIDIA/Cosmos/blob/f0679c80bb0f8af01dace7f81bb685a0f8890ba9/cosmos1/models/diffusion/model/model_t2w.py#L179-L190
- Text-To-World denoise function that performs the x0-pred: https://github.com/NVIDIA/Cosmos/blob/f0679c80bb0f8af01dace7f81bb685a0f8890ba9/cosmos1/models/diffusion/model/model_t2w.py#L208-L224
- Video-To-World x0 function that performs x0-pred + CFG (also calls into Text-To-World parent class): https://github.com/NVIDIA/Cosmos/blob/f0679c80bb0f8af01dace7f81bb685a0f8890ba9/cosmos1/models/diffusion/model/model_v2w.py#L270-L286
- Video-To-World denoise function that performs augmented x0-pred: https://github.com/NVIDIA/Cosmos/blob/f0679c80bb0f8af01dace7f81bb685a0f8890ba9/cosmos1/models/diffusion/model/model_v2w.py#L145-L149
- eps-pred from x0-preds: https://github.com/NVIDIA/Cosmos/blob/f0679c80bb0f8af01dace7f81bb685a0f8890ba9/cosmos1/models/diffusion/diffusion/functional/runge_kutta.py#L114
I'm still trying to figure out bug related to conditioning:
output2.mp4
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After even more hacking around our scheduler design, I seem to get something decent finally. It's still not following the conditioning fully but I don't really see any glaring differences any more 😅
Edit: this video looks buggy because I'm using a completely different prompt compared to input image. But the latest version works with the conditioning correctly if provided related image/prompts
output2.mp4
Well done! This will go a long way in supporting the wider adoption of Cosmos. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a comment @a-r-r-o-w, changes are good either way though!
noise_pred = self.scheduler.step(noise_pred, t, sample, return_dict=False)[1] | ||
self.scheduler._step_index -= 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we do
noise_pred = self.scheduler.precondition_outputs(sample, noise_pred, current_sigma)
It's safe to use sigma
as sigma_hat
. s_tmin
and s_tmax
are rarely used (never seen it used myself) and not supported in some other schedulers for that reason, in turn gamma
is 0
and sigma_hat
is the same as sigma
.
This PR doesn't seem to include the guardrail model: https://huggingface.co/nvidia/Cosmos-1.0-Guardrail |
@asfiyab-nvidia I didn't think to add the guardrail models because they essentially work as preprocessors/postprocessors outside the core diffusion-related aspects. Can definitely do a follow-up adding support for it. Additionally, the prompt upsampler isn't added for the similar reasons. The upsampling can be run via any language model (independent of diffusers), but I'll update the docs to point to Pixtral-12B as used in original codebase as an example. This PR contains only the parts relevant for running the diffusion sampling and generating videos. |
@a-r-r-o-w Not including the guardrail model violates the License in https://huggingface.co/nvidia/Cosmos-1.0-Diffusion-7B-Text2World. cc @pjannaty for comment on this |
@asfiyab-nvidia Thanks for the notice! I didn't check the license until now. In that case, I'll implement the guardrails tomorrow. |
@asfiyab-nvidia @pjannaty The CosmosGuardrail has been integrated as well. The relevant class to review is |
Thanks for patiently reviewing this! If everything looks good to merge, please let us know. We plan to do a diffusers release over the weekend or on Monday. It would be great to ship the Cosmos integration as well for this release cycle. In order to proceed with that, we'll have to host diffusers-format weights for the following repositories:
To host the weights, none of the existing files will be modified apart from README.md (which we can use to showcase how to run inference with diffusers). The diffusers-format folder structure would look something like: I've opened an example PR for the 7B Text-to-World weights here: https://huggingface.co/nvidia/Cosmos-1.0-Diffusion-7B-Text2World/discussions/9 Once I have the go from your end that these changes are good, I can open up PRs to all the other repositories |
@a-r-r-o-w I'm running into the below issue during pipeline load FYI. Is this expected?
|
Another note re the attention definition here. Enabling GQA breaks ONNX export due to https://github.com/pytorch/pytorch/blob/main/torch/onnx/symbolic_opset14.py#L152. Can this be addressed? |
@asfiyab-nvidia I'm testing a non- I'm not sure why you get the error about import torch
from diffusers import CosmosPipeline
from diffusers.utils import export_to_video
model_id = "nvidia/Cosmos-1.0-Diffusion-7B-Text2World"
pipe = CosmosPipeline.from_pretrained(model_id, revision="refs/pr/9", torch_dtype=torch.bfloat16)
pipe.to("cuda")
prompt = "A sleek, humanoid robot stands in a vast warehouse filled with neatly stacked cardboard boxes on industrial shelves. The robot's metallic body gleams under the bright, even lighting, highlighting its futuristic design and intricate joints. A glowing blue light emanates from its chest, adding a touch of advanced technology. The background is dominated by rows of boxes, suggesting a highly organized storage system. The floor is lined with wooden pallets, enhancing the industrial setting. The camera remains static, capturing the robot's poised stance amidst the orderly environment, with a shallow depth of field that keeps the focus on the robot while subtly blurring the background for a cinematic effect."
output = pipe(prompt=prompt).frames[0]
export_to_video(output, "output.mp4", fps=30) output.mp4I'll try to dig in more soon to see if it errors out with a different environment. |
The cosmos is within us. We are made of star-stuff. We are a way for the universe to know itself.
WIP.
Transformer
test attention
test ff
test timesteps
test patch embed
test positional embed
test transformer block
test transformer
test transformer video
VAE
test vae attention
test vae
Text-to-World:
Video-to-World (image-conditioning):
Video-to-World (video-conditioning):
Note that the model repos are not yet compatible with Diffusers-loading. I'll open PRs for weights once nvidia team gives the thumbs up.
Inference code (old)