Can not resume training #6689

q4wey · 2025-01-23T07:16:24Z

Describe the bug

using the copy parameters tab or using the checkpoint, training seems to restart when trying to resume

Is there an existing issue for this?

I have searched the existing issues

Reproduction

steps to reproduce-

load the webui using the start script
load the model
go into the training tab and name the model (plants) then select the text file for training
start training
monitor the loss(keep as a record will need for later)
once training has been completed and the webui says lora saved, close the start script terminal
Run the start script again and load up the webui
load the same model
in the training tap select copy parameters from plants
start training and monitor the loss of the second round and compare it with the first

I have found the loss to be mostly identical. during both rounds.
If you copy all the contents of latest checkpoint folder into the lora folder reload the model and try to resume training the loss is back to where it was at the beginning. I have also read #1459

Screenshot

No response

Logs

17:10:24-032063 INFO     Loading "chuanli11_Llama-3.2-3B-Instruct-uncensored"                                                               
17:10:24-130281 INFO     TRANSFORMERS_PARAMS=                                                                                               
{'low_cpu_mem_usage': True, 'torch_dtype': torch.bfloat16}

/home/fed1/text-generation-webui-2.3/installer_files/env/lib/python3.11/site-packages/transformers/generation/configuration_utils.py:638: UserWarning: `do_sample` is set to `False`. However, `min_p` is set to `0.0` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `min_p`.
  warnings.warn(
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████| 2/2 [00:01<00:00,  1.88it/s]
17:10:30-080024 INFO     Loaded "chuanli11_Llama-3.2-3B-Instruct-uncensored" in 6.05 seconds.                                               
17:10:30-080667 INFO     LOADER: "Transformers"                                                                                             
17:10:30-080986 INFO     TRUNCATION LENGTH: 131072                                                                                          
17:10:30-081290 INFO     INSTRUCTION TEMPLATE: "Custom (obtained from model metadata)"                                                      
17:11:50-316161 INFO     Loading raw text file dataset                                                                                      
17:11:50-997368 INFO     Getting model ready                                                                                                
17:11:50-998010 INFO     Preparing for training                                                                                             
17:11:50-998716 INFO     Creating LoRA model                                                                                                
/home/fed1/text-generation-webui-2.3/installer_files/env/lib/python3.11/site-packages/transformers/training_args.py:1575: FutureWarning: `evaluation_strategy` is deprecated and will be removed in version 4.46 of 🤗 Transformers. Use `eval_strategy` instead
  warnings.warn(
17:11:52-163364 INFO     Starting training                                                                                                  
Training 'llama' model using (q, v) projections
Trainable params: 55,050,240 (1.6846 %), All params: 3,267,800,064 (Model: 3,212,749,824)
17:11:52-180231 INFO     Log file 'train_dataset_sample.json' created in the 'logs' directory.                                              
wandb: WARNING The `run_name` is currently set to the same value as `TrainingArguments.output_dir`. If this was not intended, please specify a different run name by setting the `TrainingArguments.run_name` parameter.
wandb: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.
wandb: Tracking run with wandb version 0.19.4
wandb: W&B syncing is set to `offline` in this directory.  
wandb: Run `wandb online` or set WANDB_MODE=online to enable cloud syncing.
Step: 159 {'loss': 2.3269, 'grad_norm': 0.33446332812309265, 'learning_rate': 0.0003, 'epoch': 0.8163265306122449}
Step: 319 {'loss': 2.0527, 'grad_norm': 0.3678511083126068, 'learning_rate': 0.0003, 'epoch': 1.489795918367347}
Step: 479 {'loss': 1.9253, 'grad_norm': 0.3888787031173706, 'learning_rate': 0.0003, 'epoch': 2.163265306122449}
Step: 575 {'train_runtime': 177.1225, 'train_samples_per_second': 13.262, 'train_steps_per_second': 0.102, 'train_loss': 2.05030984348721, 'epoch': 2.6530612244897958}
17:14:49-567885 INFO     LoRA training run is completed and saved.                                                                          
17:14:49-940228 INFO     Training complete, saving                                                                                          
17:14:50-146733 INFO     Training complete!                                                                                                 
17:19:15-632075 INFO     Loading "chuanli11_Llama-3.2-3B-Instruct-uncensored"                                                               
17:19:15-755350 INFO     TRANSFORMERS_PARAMS=                                                                                               
{'low_cpu_mem_usage': True, 'torch_dtype': torch.bfloat16}

/home/fed1/text-generation-webui-2.3/installer_files/env/lib/python3.11/site-packages/transformers/generation/configuration_utils.py:638: UserWarning: `do_sample` is set to `False`. However, `min_p` is set to `0.0` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `min_p`.
  warnings.warn(
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  8.39it/s]
17:19:18-007344 INFO     Loaded "chuanli11_Llama-3.2-3B-Instruct-uncensored" in 2.37 seconds.                                               
17:19:18-007972 INFO     LOADER: "Transformers"                                                                                             
17:19:18-008310 INFO     TRUNCATION LENGTH: 131072                                                                                          
17:19:18-008617 INFO     INSTRUCTION TEMPLATE: "Custom (obtained from model metadata)"                                                      
17:19:29-696583 INFO     Loading raw text file dataset                                                                                      
17:19:30-382638 INFO     Getting model ready                                                                                                
17:19:30-383396 INFO     Preparing for training                                                                                             
17:19:30-384255 INFO     Creating LoRA model                                                                                                
/home/fed1/text-generation-webui-2.3/installer_files/env/lib/python3.11/site-packages/transformers/training_args.py:1575: FutureWarning: `evaluation_strategy` is deprecated and will be removed in version 4.46 of 🤗 Transformers. Use `eval_strategy` instead
  warnings.warn(
17:19:30-627227 INFO     Starting training                                                                                                  
Training 'llama' model using (q, v) projections
Trainable params: 55,050,240 (1.6846 %), All params: 3,267,800,064 (Model: 3,212,749,824)
17:19:30-637081 INFO     Log file 'train_dataset_sample.json' created in the 'logs' directory.                                              
Step: 159 {'loss': 2.3268, 'grad_norm': 0.3329761028289795, 'learning_rate': 0.0003, 'epoch': 0.8163265306122449}
Step: 319 {'loss': 2.0525, 'grad_norm': 0.36472970247268677, 'learning_rate': 0.0003, 'epoch': 1.489795918367347}
Step: 479 {'loss': 1.9251, 'grad_norm': 0.38644668459892273, 'learning_rate': 0.0003, 'epoch': 2.163265306122449}
Step: 575 {'train_runtime': 177.3279, 'train_samples_per_second': 13.247, 'train_steps_per_second': 0.102, 'train_loss': 2.050247483783298, 'epoch': 2.6530612244897958}
17:22:28-290855 INFO     LoRA training run is completed and saved.                                                                          
17:22:28-601034 INFO     Training complete, saving                                                                                          
17:22:28-788072 INFO     Training complete!                                                                                                 
17:32:19-924805 INFO     Loading "chuanli11_Llama-3.2-3B-Instruct-uncensored"                                                               
17:32:20-037364 INFO     TRANSFORMERS_PARAMS=                                                                                               
{'low_cpu_mem_usage': True, 'torch_dtype': torch.bfloat16}

/home/fed1/text-generation-webui-2.3/installer_files/env/lib/python3.11/site-packages/transformers/generation/configuration_utils.py:638: UserWarning: `do_sample` is set to `False`. However, `min_p` is set to `0.0` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `min_p`.
  warnings.warn(
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  8.54it/s]
17:32:22-286983 INFO     Loaded "chuanli11_Llama-3.2-3B-Instruct-uncensored" in 2.36 seconds.                                               
17:32:22-287675 INFO     LOADER: "Transformers"                                                                                             
17:32:22-288040 INFO     TRUNCATION LENGTH: 131072                                                                                          
17:32:22-288376 INFO     INSTRUCTION TEMPLATE: "Custom (obtained from model metadata)"                                                      
17:32:31-917983 INFO     Loading raw text file dataset                                                                                      
17:32:32-609262 INFO     Getting model ready                                                                                                
17:32:32-610008 INFO     Preparing for training                                                                                             
17:32:32-610816 INFO     Creating LoRA model                                                                                                
/home/fed1/text-generation-webui-2.3/installer_files/env/lib/python3.11/site-packages/transformers/training_args.py:1575: FutureWarning: `evaluation_strategy` is deprecated and will be removed in version 4.46 of 🤗 Transformers. Use `eval_strategy` instead
  warnings.warn(
17:32:32-854659 INFO     Starting training                                                                                                  
Training 'llama' model using (q, v) projections
Trainable params: 55,050,240 (1.6846 %), All params: 3,267,800,064 (Model: 3,212,749,824)
17:32:32-865429 INFO     Log file 'train_dataset_sample.json' created in the 'logs' directory.                                              
Step: 159 {'loss': 2.3268, 'grad_norm': 0.3329761028289795, 'learning_rate': 0.0003, 'epoch': 0.8163265306122449}
Step: 319 {'loss': 2.0525, 'grad_norm': 0.36472970247268677, 'learning_rate': 0.0003, 'epoch': 1.489795918367347}
Step: 479 {'loss': 1.9251, 'grad_norm': 0.38644668459892273, 'learning_rate': 0.0003, 'epoch': 2.163265306122449}
Step: 575 {'train_runtime': 177.9808, 'train_samples_per_second': 13.198, 'train_steps_per_second': 0.101, 'train_loss': 2.050247483783298, 'epoch': 2.6530612244897958}
17:35:31-159644 INFO     LoRA training run is completed and saved.                                                                          
17:35:31-313371 INFO     Training complete, saving                                                                                          
17:35:31-496396 INFO     Training complete!                                                                                                 
^C

System Info

gpu 7900xt 
fedora 41

q4wey added the bug Something isn't working label Jan 23, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can not resume training #6689

Can not resume training #6689

q4wey commented Jan 23, 2025

Can not resume training #6689

Can not resume training #6689

Comments

q4wey commented Jan 23, 2025

Describe the bug

Is there an existing issue for this?

Reproduction

Screenshot

Logs

System Info