[Instruct-pix2pix] RuntimeError: The size of tensor a (4) must match the size of tensor b (8) at non-singleton dimension 1 #6748

mjkmain · 2024-01-29T08:54:42Z

mjkmain
Jan 29, 2024

I am currently working with a Instruct-pix2pix and encountered a RuntimeError related to tensor sizes when trying to copy weights between convolutional layers. The relevant part of my code is as follows:

in_channels = 8
out_channels = unet.conv_in.out_channels
unet.register_to_config(in_channels=in_channels)

with torch.no_grad():
    new_conv_in = nn.Conv2d(
        in_channels, out_channels, unet.conv_in.kernel_size, unet.conv_in.stride, unet.conv_in.padding
    )
    new_conv_in.weight.zero_()
    new_conv_in.weight[:, :4, :, :].copy_(unet.conv_in.weight)
    unet.conv_in = new_conv_in

in this code, I get the error RuntimeError: The size of tensor a (4) must match the size of tensor b (8) at non-singleton dimension 1

However, when I modified the weight copying line to new_conv_in.weight[:, :8, :, :].copy_(unet.conv_in.weight), the error was resolved, and the code ran without issues.

I'm using the following command for training:

export MODEL_NAME="timbrooks/instruct-pix2pix"
export DATASET_ID="fusing/instructpix2pix-1000-samples"

CUDA_VISIBLE_DEVICES="0,1" accelerate launch --multi_gpu --mixed_precision="bf16" train_instruct_pix2pix.py \
    --pretrained_model_name_or_path=$MODEL_NAME \
    --dataset_name=$DATASET_ID \
    --enable_xformers_memory_efficient_attention \
    --resolution=512 \
    --random_flip \
    --train_batch_size=64 \
    --gradient_accumulation_steps=1 \
    --gradient_checkpointing \
    --max_train_steps=15000 \
    --checkpointing_steps=5000 \
    --checkpoints_total_limit=1 \
    --learning_rate=5e-05 \
    --max_grad_norm=1 \
    --lr_warmup_steps=0 \
    --conditioning_dropout_prob=0.05 \
    --mixed_precision=bf16 \
    --seed=42 \
    --push_to_hub

in train_instruct_pix2pix.py:

in_channels = 8
out_channels = unet.conv_in.out_channels
unet.register_to_config(in_channels=in_channels)

with torch.no_grad():
    new_conv_in = nn.Conv2d(
        in_channels, out_channels, unet.conv_in.kernel_size, unet.conv_in.stride, unet.conv_in.padding
    )
    new_conv_in.weight.zero_()
    new_conv_in.weight[:, :8, :, :].copy_(unet.conv_in.weight)
    unet.conv_in = new_conv_in

My question is whether this modification is appropriate and will not adversely affect the functioning of the neural network. The original slicing was [:4], which seemed to be intended for a specific reason. By changing it to [:8], am I potentially causing any unintended side effects, particularly concerning how the weights are initialized and used in the network?

Any insights or recommendations regarding this issue would be greatly appreciated.

sayakpaul · 2024-02-06T08:32:30Z

sayakpaul
Feb 6, 2024
Maintainer

So, you want to fine-tune from the UNet of the InstructPix2Pix and NOT from the base SD UNet.

The training script assumes the latter.

Your modification won't be needed if you want to fine-tune from the InstructPix2Pix UNet as it already 8 channels in the input stem.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Instruct-pix2pix] RuntimeError: The size of tensor a (4) must match the size of tensor b (8) at non-singleton dimension 1 #6748

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

[Instruct-pix2pix] RuntimeError: The size of tensor a (4) must match the size of tensor b (8) at non-singleton dimension 1 #6748

Uh oh!

mjkmain Jan 29, 2024

Replies: 1 comment

Uh oh!

sayakpaul Feb 6, 2024 Maintainer

mjkmain
Jan 29, 2024

sayakpaul
Feb 6, 2024
Maintainer