Skip to content

Commit f849dd4

Browse files
committed
add code
0 parents  commit f849dd4

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

52 files changed

+16002
-0
lines changed

.gitignore

+5
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
data/
2+
*.egg-info
3+
*.pyc
4+
*.pyo
5+
__pycache__

README.md

+105
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,105 @@
1+
SPAD : Spatially Aware Multiview Diffusers
2+
===================================================
3+
<h4>
4+
Yash Kant, Ziyi Wu, Michael Vasilkovsky, Gordon Qian, Jian Ren, Riza Alp Guler, Bernard Ghanem, Sergey Tulyakov*, Igor Gilitschenski*, Aliaksandr Siarohin*
5+
</br>
6+
<span style="font-size: 14pt; color: #555555">
7+
Published at CVPR, 2024
8+
</span>
9+
</h4>
10+
<hr>
11+
12+
**Paper:** [https://arxiv.org/abs/2402.05235](https://https://arxiv.org/abs/2402.05235)
13+
14+
**Project Page:** [https://yashkant.github.io/spad/](https://yashkant.github.io/spad/)
15+
16+
17+
<p align="center">
18+
<img src="data/visuals/readme/spad_pipeline.png">
19+
</p>
20+
21+
Model pipeline. (a) We fine-tune a pre-trained text-to-image diffusion model on multi-view rendering of 3D objects.
22+
(b) Our model jointly denoises noisy multi-view images conditioned on text and relative camera poses. To enable cross-view interaction, we apply 3D self-attention by concatenating all views, and enforce epipolar constraints on the attention map.
23+
(c) We further add Plücker Embedding to the attention layers as positional encodings, to enhance camera control.
24+
25+
## Filtered High-Quality Objaverse
26+
If you are looking for the objaverse assets we used to train SPAD models, you can find that list here: [filtered_objaverse.txt](https://github.com/yashkant/spad/data/filtered_objaverse.txt).
27+
28+
To see how this list was generated / tweak its parameters, you can try this colab notebook here: [filter_objaverse.ipynb](https://colab.research.google.com/drive/1UJM4caaBJsYOkP7EmjPjBvoJ7U0qY4kq#scrollTo=sR28TydbQUuT)
29+
30+
## Visualizing and Creating Epipolar Masks
31+
If you would like to visualize the epipolar masks and plucker embeddings or use them as separate module, read and run the following script:
32+
33+
```
34+
python scripts/visualize_epipolar_mask.py
35+
```
36+
37+
## Repository Setup
38+
39+
Create a fresh conda environment, and install all dependencies.
40+
41+
```text
42+
conda create -n spad python=3.8 -y
43+
conda activate spad
44+
```
45+
46+
Clone the repository with submodules using the following command:
47+
48+
```text
49+
git clone --recursive https://github.com/yashkant/spad
50+
cd spad
51+
```
52+
53+
If you already have the repository cloned, you can update the submodules using the following command:
54+
```text
55+
git submodule update --init --recursive
56+
```
57+
58+
Install dependencies and pytorch (tested with CUDA 11.8):
59+
```
60+
pip install -r requirements.txt
61+
pip install --ignore-installed torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
62+
```
63+
64+
<!-- (is this needed??) Install the `spad` package:
65+
```text
66+
pip install -e .
67+
``` -->
68+
69+
70+
## Download Files
71+
Download files from the [dropbox link](https://www.dropbox.com/sh/dk6oubjlt2x7w0h/AAAKExm33IKnVe8mkC4tOzUKa) and place it in the ``data/`` folder.
72+
Ensure that data paths match the directory structure provided in ``data/README.md``
73+
74+
## Pretrained Model
75+
76+
We provide two pretrained models, with following specifications:
77+
- `spad_two_views`: Trained with learning rate 1e-4, relative cameras (between views) and no intrinsics, two views, random viewpoints.
78+
- `spad_four_views`: Trained with learning rate 2e-5, absolute cameras (between views) with intrinsics, four views, random + orthogonal viewpoints
79+
80+
You can test these models out using:
81+
```
82+
python scripts/inference.py --model <model_name>
83+
```
84+
85+
You can adjust the following hyperparameters for best results:
86+
```
87+
--cfg_scale: 3.0 to 9.0 (default 7.5)
88+
--blob_sigma: 0.2 to 0.7 (default 0.5)
89+
--ddim_steps: 50 to 1000 (default 100)
90+
```
91+
92+
## Citation
93+
Consider citing our work:
94+
```
95+
@misc{kant2024spad,
96+
title={SPAD : Spatially Aware Multiview Diffusers},
97+
author={Yash Kant and Ziyi Wu and Michael Vasilkovsky and Guocheng Qian and Jian Ren and Riza Alp Guler and Bernard Ghanem and Sergey Tulyakov and Igor Gilitschenski and Aliaksandr Siarohin},
98+
year={2024},
99+
eprint={2402.05235},
100+
archivePrefix={arXiv},
101+
primaryClass={cs.CV}
102+
}
103+
```
104+
105+

configs/spad3.yaml

+126
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,126 @@
1+
model:
2+
base_learning_rate: 0.0001
3+
resume_path: data/v1-5-pruned.ckpt
4+
fast_attention: true
5+
target: nvs.nvs_ldm.TextViews
6+
params:
7+
linear_start: 0.00085
8+
linear_end: 0.012
9+
num_timesteps_cond: 1
10+
log_every_t: 200
11+
timesteps: 1000
12+
first_stage_key: none
13+
cond_stage_key: none
14+
conditioning_key: hybrid-mv
15+
cc_type: timesteps_only_emb
16+
mv_timesteps_style: same
17+
oom_fix: true
18+
image_size: 32
19+
channels: 4
20+
cond_stage_trainable: false
21+
monitor: val/loss_simple_ema
22+
scale_factor: 0.18215
23+
cfg_conds:
24+
- txt
25+
cfg_scales:
26+
- 7.5
27+
skip_plucker: false
28+
skip_epi: false
29+
scheduler_config:
30+
target: ldm.lr_scheduler.LambdaLinearScheduler
31+
params:
32+
warm_up_steps:
33+
- 100
34+
cycle_lengths:
35+
- 10000000000000
36+
f_start:
37+
- 1.0e-06
38+
f_max:
39+
- 1.0
40+
f_min:
41+
- 1.0
42+
unet_config:
43+
target: nvs.branch_unet.ManyStreamUnetModel
44+
params:
45+
image_size: 32
46+
in_channels: 4
47+
out_channels: 4
48+
model_channels: 320
49+
attention_resolutions:
50+
- 4
51+
- 2
52+
- 1
53+
num_res_blocks: 2
54+
channel_mult:
55+
- 1
56+
- 2
57+
- 4
58+
- 4
59+
num_heads: 8
60+
use_spatial_transformer: true
61+
transformer_depth: 1
62+
context_dim: 768
63+
use_checkpoint: true
64+
legacy: false
65+
denoise_channels: 4
66+
in_feat_channels: 6
67+
decode_cross: true
68+
post_init_type: manystream-plucker
69+
first_stage_config:
70+
target: ldm.models.autoencoder.AutoencoderKL
71+
params:
72+
embed_dim: 4
73+
monitor: val/rec_loss
74+
ddconfig:
75+
double_z: true
76+
z_channels: 4
77+
resolution: 256
78+
in_channels: 3
79+
out_ch: 3
80+
ch: 128
81+
ch_mult:
82+
- 1
83+
- 2
84+
- 4
85+
- 4
86+
num_res_blocks: 2
87+
attn_resolutions: []
88+
dropout: 0.0
89+
lossconfig:
90+
target: torch.nn.Identity
91+
cond_stage_config:
92+
target: ldm.modules.encoders.modules.FrozenCLIPEmbedder
93+
ckpt_path: null
94+
data:
95+
target: nvs.z123_dataset.ManyViewsDataModuleFromConfig
96+
params:
97+
root_dir: /nfs/data/objaverse/rendering
98+
batch_size: 40
99+
views_per_sample_range:
100+
- 2
101+
- 2
102+
batch_size_dict:
103+
laion: 60
104+
mv:
105+
2: 30
106+
4: 12
107+
num_workers: 20
108+
total_view: 4
109+
laion_batch_prob: 0.1
110+
setup: polyview-random
111+
additional_setups: []
112+
add_text: true
113+
add_text_tok: true
114+
only_text_samples: true
115+
text_type: cap3d_no_3d
116+
use_internal_filter: meta_filtered
117+
laion_type: 625K
118+
mask_init: ones
119+
mv_datasets:
120+
- objaverse
121+
debug: false
122+
train:
123+
validation: false
124+
image_transforms:
125+
size: 256
126+

configs/spad_four_views.yaml

+89
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,89 @@
1+
model:
2+
base_learning_rate: 2.0e-05
3+
resume_path: data/v1-5-pruned.ckpt
4+
fast_attention: true
5+
target: spad.spad.SPAD
6+
params:
7+
linear_start: 0.00085
8+
linear_end: 0.012
9+
num_timesteps_cond: 1
10+
log_every_t: 200
11+
timesteps: 1000
12+
first_stage_key: none
13+
cond_stage_key: none
14+
conditioning_key: hybrid-mv
15+
use_intrinsic: true
16+
use_abs_extrinsics: true
17+
image_size: 32
18+
channels: 4
19+
cond_stage_trainable: false
20+
monitor: val/loss_simple_ema
21+
scale_factor: 0.18215
22+
cfg_conds:
23+
- txt
24+
cfg_scales:
25+
- 7.5
26+
scheduler_config:
27+
target: ldm.lr_scheduler.LambdaLinearScheduler
28+
params:
29+
warm_up_steps:
30+
- 100
31+
cycle_lengths:
32+
- 10000000000000
33+
f_start:
34+
- 1.0e-06
35+
f_max:
36+
- 1.0
37+
f_min:
38+
- 1.0
39+
40+
unet_config:
41+
target: spad.mv_unet.SPADUnetModel
42+
params:
43+
image_size: 32
44+
in_channels: 4
45+
out_channels: 4
46+
model_channels: 320
47+
attention_resolutions:
48+
- 4
49+
- 2
50+
- 1
51+
num_res_blocks: 2
52+
channel_mult:
53+
- 1
54+
- 2
55+
- 4
56+
- 4
57+
num_heads: 8
58+
use_spatial_transformer: true
59+
transformer_depth: 1
60+
context_dim: 768
61+
use_checkpoint: true
62+
legacy: false
63+
64+
first_stage_config:
65+
target: ldm.models.autoencoder.AutoencoderKL
66+
params:
67+
embed_dim: 4
68+
monitor: val/rec_loss
69+
ddconfig:
70+
double_z: true
71+
z_channels: 4
72+
resolution: 256
73+
in_channels: 3
74+
out_ch: 3
75+
ch: 128
76+
ch_mult:
77+
- 1
78+
- 2
79+
- 4
80+
- 4
81+
num_res_blocks: 2
82+
attn_resolutions: []
83+
dropout: 0.0
84+
lossconfig:
85+
target: torch.nn.Identity
86+
87+
cond_stage_config:
88+
target: ldm.modules.encoders.modules.FrozenCLIPEmbedder
89+
ckpt_path: null

0 commit comments

Comments
 (0)