Add PhysicsIQ benchmark reproduction cookbook for Cosmos3#194
Add PhysicsIQ benchmark reproduction cookbook for Cosmos3#194akashgokul wants to merge 1 commit into
Conversation
90346ea to
79ddcae
Compare
79ddcae to
83f1ac9
Compare
| " -i \"$V2V_FULL_INPUT\" \\\n", | ||
| " -o \"$V2V_FULL_OUTPUT_DIR\" \\\n", | ||
| " --checkpoint-path \"$CHECKPOINT\" \\\n", | ||
| " --no-guardrails" |
There was a problem hiding this comment.
Will this no-guardails needed? And if really need, might could be noted for some security attention?
There was a problem hiding this comment.
@lfengad The reason no-guardrails is because this notebook is used to reproduce our evaluation score for Physics-IQ benchmark. For the scores reported in our paper we did not use guardrails. I believe this is fine as the Physics-IQ prompts are appropriate and turning on guardrails may cause the blurring that could cause lower scores.
Please let me know, if there is something I should do to handle this (e.g. may be put a warning notice about the no-guardrails)
There was a problem hiding this comment.
Also "Need we move these into cookbooks/cosmos3/generator/physicsiq/ for consistency of the strcuture?", @mingyuliutw asked me to put this in evaluation folder as seen in this PR.
There was a problem hiding this comment.
Yeah, I think keeping the benchmark part in a seperated evaluation folder is more appropriate.
|
Can you upload the prompt file to hf and download? It is quite large for a github asset. |
Adds an end-to-end notebook for reproducing the PhysicsIQ benchmark with Cosmos3-Super using the native cosmos-framework PyTorch entrypoint. Location: evaluation/cosmos3/Physics_IQ/ Contents: - run_with_cosmos_framework.ipynb: walks through I2V and V2V task formats end-to-end — download the PhysicsIQ dataset, generate, stage, and optionally score with the official PhysicsIQ scorer. - assets/i2v_prompts.json: 198 per-case I2V prompts + negative prompts - assets/v2v_prompts.json: 198 per-case V2V prompts + negative prompts Reference scores (Cosmos3-Super): I2V 43.8, V2V 59.7. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
83f1ac9 to
2b70e51
Compare
Adds an end-to-end notebook reproducing the PhysicsIQ benchmark with Cosmos3-Super (and Cosmos3-Nano) via the native cosmos-framework PyTorch entrypoint. Covers both I2V and V2V task formats with verified reference scores (I2V: 43.8, V2V: 59.7). Also adds the prompts we used for I2V and V2V in assets.