Adding capability to start a training from model checkpoint instead of doing it from scratch #297

karolzak · 2024-02-01T11:05:44Z

Small change introducing the option to provide a path (through location config) for model checkpoint to be used to load weights before starting a new training. I used this with success for finetuning LaMa model to my custom dataset.

CC: @senya-ashukha @cohimame

Add train from checkpoint capability

Abbsalehi · 2024-03-01T15:07:53Z

@karolzak thanks for your good work, I want to fine-tune the model. But I could not find how to do it? could you please let me know how to use your work? thanks

karolzak · 2024-03-04T12:37:39Z

@karolzak thanks for your good work, I want to fine-tune the model. But I could not find how to do it? could you please let me know how to use your work? thanks

Thanks @Abbsalehi !
For the preparation of the training just follow all the standard steps in the root README doc. To use model fine-tuning rather than training from scratch you need to either create a new config or modify one of the existing configs placed under configs/training/location (depending on which one you are using).

More specifically you need to add a variable like below:
load_checkpoint_path: /home/user/lama/big-lama/models/best.ckpt

In my trials, I created a new config called article_dataset.yaml and placed it under configs/training/location and its content looked like this:

data_root_dir: /home/azureuser/localfiles/image-inpainting/datasets/article-dataset/processed/
out_root_dir: /home/azureuser/localfiles/lama/experiments/
tb_dir: /home/azureuser/localfiles/lama/tb_logs/
load_checkpoint_path: /home/azureuser/localfiles/lama/experiments/azureuser_2024-02-01_12-17-01_train_big-lama_/models/epoch=7-step=2559.ckpt

After you create your new config you can run something like this to kick off the training:

python3 bin/train.py -cn big-lama location=article_dataset.yaml data.batch_size=10

When this new variable is present in the config, the training script will try to instantiate the model from a previously trained model checkpoint. In my trials I just used big-lama pretrained model which can be downloaded from google drive of LaMa authors.
Let me know if something is unclear.

Abbsalehi · 2024-03-07T23:06:50Z

@karolzak thanks a lot for your helpful response. In the Readme file, it says to provide the below directories, how many images did you put in these folders as I do not have many images?

Readme:

You need to prepare the following image folders:

$ ls my_dataset
train
val_source # 2000 or more images
visual_test_source # 100 or more images
eval_source # 2000 or more images

karolzak · 2024-03-08T13:12:42Z

@karolzak thanks a lot for your helpful response. In the Readme file, it says to provide the below directories, how many images did you put in these folders as I do not have many images?

Readme:

You need to prepare the following image folders:

$ ls my_dataset train val_source # 2000 or more images visual_test_source # 100 or more images eval_source # 2000 or more images

I followed the recommendation from the docs but I'm not sure if this is necessarily needed. I'm not aware if this is coming from some hardcoded checks or is it more as a "for best performance" kind of suggestion. I would suggest to try with however many images you have and see what happens

Abbsalehi · 2024-03-13T17:35:55Z

Thanks @karolzak, I could start training the model. However, I am wondering if is it possible to use multi-GPU to accelerate the process.

Abbsalehi · 2024-03-19T19:41:27Z

@karolzak could you please help me to understand the below table from the result of one epoch validation? I do not understand "std" is calculated from which metric? Why some values are NaN? and what are the percentage ranges in the first column?

              fid     lpips                ssim           ssim_fid100_f1
             mean      mean       std      mean       std           mean
0-10%    7.132144  0.025758  0.015533  0.975447  0.019605            NaN
10-20%  22.423028  0.081735  0.020867  0.920162  0.035067            NaN
20-30%  38.135151  0.138476  0.024617  0.863151  0.047236            NaN
30-40%  56.557434  0.196688  0.030477  0.810011  0.065147            NaN
40-50%  76.543753  0.260003  0.037845  0.748839  0.081490            NaN
total   14.605970  0.141385  0.084988  0.862623  0.094825        0.85776

bekhzod-olimov · 2024-07-10T23:54:25Z

Hey guys, @karolzak, @Abbsalehi! Could you please provide a link for the "e-commerce" dataset in described in the blog? The provided link in Kaggle does not seem to exist anymore :(

ShiChengxin-0810 · 2025-10-24T13:25:10Z

hello i use this command :python3 bin/train.py -cn big-lama location=123.yaml data.batch_size=4,but it has some errors,can you help me ?
log:[2025-10-24 21:19:15,180][saicinpainting.training.data.datasets][INFO] - Make val dataloader default from /root/lama/data//val
[2025-10-24 21:19:15,184][saicinpainting.training.data.datasets][INFO] - Make val dataloader default from /root/lama/data//visual_test
[2025-10-24 21:19:15,187][main][CRITICAL] - Training failed due to Dataloader returned 0 length. Please make sure that it returns at least 1 batch:
Traceback (most recent call last):
File "bin/train.py", line 64, in main
trainer.fit(training_model)
File "/root/miniconda3/envs/lama/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 499, in fit
self.dispatch()
File "/root/miniconda3/envs/lama/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 546, in dispatch
self.accelerator.start_training(self)
File "/root/miniconda3/envs/lama/lib/python3.6/site-packages/pytorch_lightning/accelerators/accelerator.py", line 73, in start_training
self.training_type_plugin.start_training(trainer)
File "/root/miniconda3/envs/lama/lib/python3.6/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 114, in start_training
self._results = trainer.run_train()
File "/root/miniconda3/envs/lama/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 607, in run_train
self.run_sanity_check(self.lightning_module)
File "/root/miniconda3/envs/lama/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 854, in run_sanity_check
self.reset_val_dataloader(ref_model)
File "/root/miniconda3/envs/lama/lib/python3.6/site-packages/pytorch_lightning/trainer/data_loading.py", line 364, in reset_val_dataloader
self.num_val_batches, self.val_dataloaders = self._reset_eval_dataloader(model, 'val')
File "/root/miniconda3/envs/lama/lib/python3.6/site-packages/pytorch_lightning/trainer/data_loading.py", line 325, in _reset_eval_dataloader
num_batches = len(dataloader) if has_len(dataloader) else float('inf')
File "/root/miniconda3/envs/lama/lib/python3.6/site-packages/pytorch_lightning/utilities/data.py", line 33, in has_len
raise ValueError('Dataloader returned 0 length. Please make sure that it returns at least 1 batch')
ValueError: Dataloader returned 0 length. Please make sure that it returns at least 1 batch
i want to know your dataset's location and how to arrange it!

karolzak added 4 commits February 1, 2024 11:54

Update train.py

bcf9c27

Add train from checkpoint capability

Update places_example.yaml

630948e

Update docker.yaml

448ea11

Update celeba_example.yaml

167c316

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Adding capability to start a training from model checkpoint instead of doing it from scratch #297

Adding capability to start a training from model checkpoint instead of doing it from scratch #297

Uh oh!

karolzak commented Feb 1, 2024 •

edited

Loading

Uh oh!

Abbsalehi commented Mar 1, 2024 •

edited

Loading

Uh oh!

karolzak commented Mar 4, 2024

Uh oh!

Abbsalehi commented Mar 7, 2024 •

edited

Loading

Uh oh!

karolzak commented Mar 8, 2024

You need to prepare the following image folders:

Uh oh!

Abbsalehi commented Mar 13, 2024 •

edited

Loading

Uh oh!

Abbsalehi commented Mar 19, 2024 •

edited

Loading

Uh oh!

bekhzod-olimov commented Jul 10, 2024

Uh oh!

ShiChengxin-0810 commented Oct 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Adding capability to start a training from model checkpoint instead of doing it from scratch #297

Are you sure you want to change the base?

Adding capability to start a training from model checkpoint instead of doing it from scratch #297

Uh oh!

Conversation

karolzak commented Feb 1, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Abbsalehi commented Mar 1, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

karolzak commented Mar 4, 2024

Uh oh!

Abbsalehi commented Mar 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

You need to prepare the following image folders:

Uh oh!

karolzak commented Mar 8, 2024

You need to prepare the following image folders:

Uh oh!

Abbsalehi commented Mar 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Abbsalehi commented Mar 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bekhzod-olimov commented Jul 10, 2024

Uh oh!

ShiChengxin-0810 commented Oct 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

karolzak commented Feb 1, 2024 •

edited

Loading

Abbsalehi commented Mar 1, 2024 •

edited

Loading

Abbsalehi commented Mar 7, 2024 •

edited

Loading

Abbsalehi commented Mar 13, 2024 •

edited

Loading

Abbsalehi commented Mar 19, 2024 •

edited

Loading