Skip to content
This repository has been archived by the owner on Feb 7, 2025. It is now read-only.

AutoEncoderKL output tensor dimension mismatch with Input #498

Open
shankartmv opened this issue Jul 11, 2024 · 3 comments
Open

AutoEncoderKL output tensor dimension mismatch with Input #498

shankartmv opened this issue Jul 11, 2024 · 3 comments
Assignees

Comments

@shankartmv
Copy link

I am trying to train a AutoEncoderKL model on RGB images with the following dimensions (3,1225,966). Here is the code that I use ( similar to what's there in tutorials/generative/2d_ldm/2d_ldm_tutorial.ipynb ).
autoencoderkl = AutoencoderKL(
spatial_dims=2,
in_channels=3,
out_channels=3,
num_channels=(128, 256, 384),
latent_channels=8,
num_res_blocks=1,
attention_levels=(False, False, False),
with_encoder_nonlocal_attn=False,
with_decoder_nonlocal_attn=False,
)
autoencoderkl = autoencoderkl.to(device)

Error is reported at line 27 (Train Model - as in the tutorials notebook)
recons_loss = F.l1_loss(reconstruction.float(), images.float()) RuntimeError: The size of tensor a (964) must match the size of tensor b (966) at non-singleton dimension 3

Using pytorchinfo package , I was able to print the model summary and can find the discrepancy in the upsampling layer.

===================================================================================================================
Layer (type:depth-idx) Input Shape Output Shape Param #
===================================================================================================================
AutoencoderKL [1, 3, 1225, 966] [1, 3, 1224, 964] --
├─Encoder: 1-1 [1, 3, 1225, 966] [1, 8, 306, 241] --
│ └─ModuleList: 2-1 -- -- --
│ │ └─Convolution: 3-1 [1, 3, 1225, 966] [1, 128, 1225, 966] 3,584
│ │ └─ResBlock: 3-2 [1, 128, 1225, 966] [1, 128, 1225, 966] 295,680
│ │ └─Downsample: 3-3 [1, 128, 1225, 966] [1, 128, 612, 483] 147,584
│ │ └─ResBlock: 3-4 [1, 128, 612, 483] [1, 256, 612, 483] 919,040
│ │ └─Downsample: 3-5 [1, 256, 612, 483] [1, 256, 306, 241] 590,080
│ │ └─ResBlock: 3-6 [1, 256, 306, 241] [1, 384, 306, 241] 2,312,576
│ │ └─GroupNorm: 3-7 [1, 384, 306, 241] [1, 384, 306, 241] 768
│ │ └─Convolution: 3-8 [1, 384, 306, 241] [1, 8, 306, 241] 27,656
├─Convolution: 1-2 [1, 8, 306, 241] [1, 8, 306, 241] --
│ └─Conv2d: 2-2 [1, 8, 306, 241] [1, 8, 306, 241] 72
├─Convolution: 1-3 [1, 8, 306, 241] [1, 8, 306, 241] --
│ └─Conv2d: 2-3 [1, 8, 306, 241] [1, 8, 306, 241] 72
├─Convolution: 1-4 [1, 8, 306, 241] [1, 8, 306, 241] --
│ └─Conv2d: 2-4 [1, 8, 306, 241] [1, 8, 306, 241] 72
├─Decoder: 1-5 [1, 8, 306, 241] [1, 3, 1224, 964] --
│ └─ModuleList: 2-5 -- -- --
│ │ └─Convolution: 3-9 [1, 8, 306, 241] [1, 384, 306, 241] 28,032
│ │ └─ResBlock: 3-10 [1, 384, 306, 241] [1, 384, 306, 241] 2,656,512
│ │ └─Upsample: 3-11 [1, 384, 306, 241] [1, 384, 612, 482] 1,327,488
│ │ └─ResBlock: 3-12 [1, 384, 612, 482] [1, 256, 612, 482] 1,574,912
│ │ └─Upsample: 3-13 [1, 256, 612, 482] [1, 256, 1224, 964] 590,080
│ │ └─ResBlock: 3-14 [1, 256, 1224, 964] [1, 128, 1224, 964] 476,288
│ │ └─GroupNorm: 3-15 [1, 128, 1224, 964] [1, 128, 1224, 964] 256
│ │ └─Convolution: 3-16 [1, 128, 1224, 964] [1, 3, 1224, 964] 3,459
===================================================================================================================
Total params: 10,954,211
Trainable params: 10,954,211
Non-trainable params: 0
Total mult-adds (Units.TERABYTES): 3.20
===================================================================================================================
Input size (MB): 14.20
Forward/backward pass size (MB): 26803.57
Params size (MB): 43.82
Estimated Total Size (MB): 26861.59
===================================================================================================================

@shankartmv
Copy link
Author

After some debugging I figured out a way to get around this problem. By resizing my images to standard 3:2 aspect ratio, (1024*720) I can see that the input and output shapes (obtained from pytorch.summary) of my AutoEncoderKL is consistent. But anyways, I would like to know the reason behind this error.

@xmhGit
Copy link

xmhGit commented Aug 6, 2024

I believe this is caused by downsampling and upsampling on data with a nan 2 power dimension.

@virginiafdez
Copy link
Contributor

I think this happens cause you have downsamplings that divide the spatial dimensions by 2 and upsample, so unless you play around with the paddings and strides to make sure things end up having the same size, you might run into errors. I would recommend simply padding your inputs to a size that is consistently divisible by 2.

@virginiafdez virginiafdez self-assigned this Oct 25, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants