Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can not resume training #6689

Open
1 task done
q4wey opened this issue Jan 23, 2025 · 0 comments
Open
1 task done

Can not resume training #6689

q4wey opened this issue Jan 23, 2025 · 0 comments
Labels
bug Something isn't working

Comments

@q4wey
Copy link

q4wey commented Jan 23, 2025

Describe the bug

using the copy parameters tab or using the checkpoint, training seems to restart when trying to resume

Is there an existing issue for this?

  • I have searched the existing issues

Reproduction

steps to reproduce-

  1. load the webui using the start script
  2. load the model
  3. go into the training tab and name the model (plants) then select the text file for training
  4. start training
  5. monitor the loss(keep as a record will need for later)
  6. once training has been completed and the webui says lora saved, close the start script terminal
  7. Run the start script again and load up the webui
  8. load the same model
  9. in the training tap select copy parameters from plants
  10. start training and monitor the loss of the second round and compare it with the first

I have found the loss to be mostly identical. during both rounds.
If you copy all the contents of latest checkpoint folder into the lora folder reload the model and try to resume training the loss is back to where it was at the beginning. I have also read #1459

Screenshot

No response

Logs

17:10:24-032063 INFO     Loading "chuanli11_Llama-3.2-3B-Instruct-uncensored"                                                               
17:10:24-130281 INFO     TRANSFORMERS_PARAMS=                                                                                               
{'low_cpu_mem_usage': True, 'torch_dtype': torch.bfloat16}

/home/fed1/text-generation-webui-2.3/installer_files/env/lib/python3.11/site-packages/transformers/generation/configuration_utils.py:638: UserWarning: `do_sample` is set to `False`. However, `min_p` is set to `0.0` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `min_p`.
  warnings.warn(
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████| 2/2 [00:01<00:00,  1.88it/s]
17:10:30-080024 INFO     Loaded "chuanli11_Llama-3.2-3B-Instruct-uncensored" in 6.05 seconds.                                               
17:10:30-080667 INFO     LOADER: "Transformers"                                                                                             
17:10:30-080986 INFO     TRUNCATION LENGTH: 131072                                                                                          
17:10:30-081290 INFO     INSTRUCTION TEMPLATE: "Custom (obtained from model metadata)"                                                      
17:11:50-316161 INFO     Loading raw text file dataset                                                                                      
17:11:50-997368 INFO     Getting model ready                                                                                                
17:11:50-998010 INFO     Preparing for training                                                                                             
17:11:50-998716 INFO     Creating LoRA model                                                                                                
/home/fed1/text-generation-webui-2.3/installer_files/env/lib/python3.11/site-packages/transformers/training_args.py:1575: FutureWarning: `evaluation_strategy` is deprecated and will be removed in version 4.46 of 🤗 Transformers. Use `eval_strategy` instead
  warnings.warn(
17:11:52-163364 INFO     Starting training                                                                                                  
Training 'llama' model using (q, v) projections
Trainable params: 55,050,240 (1.6846 %), All params: 3,267,800,064 (Model: 3,212,749,824)
17:11:52-180231 INFO     Log file 'train_dataset_sample.json' created in the 'logs' directory.                                              
wandb: WARNING The `run_name` is currently set to the same value as `TrainingArguments.output_dir`. If this was not intended, please specify a different run name by setting the `TrainingArguments.run_name` parameter.
wandb: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.
wandb: Tracking run with wandb version 0.19.4
wandb: W&B syncing is set to `offline` in this directory.  
wandb: Run `wandb online` or set WANDB_MODE=online to enable cloud syncing.
Step: 159 {'loss': 2.3269, 'grad_norm': 0.33446332812309265, 'learning_rate': 0.0003, 'epoch': 0.8163265306122449}
Step: 319 {'loss': 2.0527, 'grad_norm': 0.3678511083126068, 'learning_rate': 0.0003, 'epoch': 1.489795918367347}
Step: 479 {'loss': 1.9253, 'grad_norm': 0.3888787031173706, 'learning_rate': 0.0003, 'epoch': 2.163265306122449}
Step: 575 {'train_runtime': 177.1225, 'train_samples_per_second': 13.262, 'train_steps_per_second': 0.102, 'train_loss': 2.05030984348721, 'epoch': 2.6530612244897958}
17:14:49-567885 INFO     LoRA training run is completed and saved.                                                                          
17:14:49-940228 INFO     Training complete, saving                                                                                          
17:14:50-146733 INFO     Training complete!                                                                                                 
17:19:15-632075 INFO     Loading "chuanli11_Llama-3.2-3B-Instruct-uncensored"                                                               
17:19:15-755350 INFO     TRANSFORMERS_PARAMS=                                                                                               
{'low_cpu_mem_usage': True, 'torch_dtype': torch.bfloat16}

/home/fed1/text-generation-webui-2.3/installer_files/env/lib/python3.11/site-packages/transformers/generation/configuration_utils.py:638: UserWarning: `do_sample` is set to `False`. However, `min_p` is set to `0.0` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `min_p`.
  warnings.warn(
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  8.39it/s]
17:19:18-007344 INFO     Loaded "chuanli11_Llama-3.2-3B-Instruct-uncensored" in 2.37 seconds.                                               
17:19:18-007972 INFO     LOADER: "Transformers"                                                                                             
17:19:18-008310 INFO     TRUNCATION LENGTH: 131072                                                                                          
17:19:18-008617 INFO     INSTRUCTION TEMPLATE: "Custom (obtained from model metadata)"                                                      
17:19:29-696583 INFO     Loading raw text file dataset                                                                                      
17:19:30-382638 INFO     Getting model ready                                                                                                
17:19:30-383396 INFO     Preparing for training                                                                                             
17:19:30-384255 INFO     Creating LoRA model                                                                                                
/home/fed1/text-generation-webui-2.3/installer_files/env/lib/python3.11/site-packages/transformers/training_args.py:1575: FutureWarning: `evaluation_strategy` is deprecated and will be removed in version 4.46 of 🤗 Transformers. Use `eval_strategy` instead
  warnings.warn(
17:19:30-627227 INFO     Starting training                                                                                                  
Training 'llama' model using (q, v) projections
Trainable params: 55,050,240 (1.6846 %), All params: 3,267,800,064 (Model: 3,212,749,824)
17:19:30-637081 INFO     Log file 'train_dataset_sample.json' created in the 'logs' directory.                                              
Step: 159 {'loss': 2.3268, 'grad_norm': 0.3329761028289795, 'learning_rate': 0.0003, 'epoch': 0.8163265306122449}
Step: 319 {'loss': 2.0525, 'grad_norm': 0.36472970247268677, 'learning_rate': 0.0003, 'epoch': 1.489795918367347}
Step: 479 {'loss': 1.9251, 'grad_norm': 0.38644668459892273, 'learning_rate': 0.0003, 'epoch': 2.163265306122449}
Step: 575 {'train_runtime': 177.3279, 'train_samples_per_second': 13.247, 'train_steps_per_second': 0.102, 'train_loss': 2.050247483783298, 'epoch': 2.6530612244897958}
17:22:28-290855 INFO     LoRA training run is completed and saved.                                                                          
17:22:28-601034 INFO     Training complete, saving                                                                                          
17:22:28-788072 INFO     Training complete!                                                                                                 
17:32:19-924805 INFO     Loading "chuanli11_Llama-3.2-3B-Instruct-uncensored"                                                               
17:32:20-037364 INFO     TRANSFORMERS_PARAMS=                                                                                               
{'low_cpu_mem_usage': True, 'torch_dtype': torch.bfloat16}

/home/fed1/text-generation-webui-2.3/installer_files/env/lib/python3.11/site-packages/transformers/generation/configuration_utils.py:638: UserWarning: `do_sample` is set to `False`. However, `min_p` is set to `0.0` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `min_p`.
  warnings.warn(
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  8.54it/s]
17:32:22-286983 INFO     Loaded "chuanli11_Llama-3.2-3B-Instruct-uncensored" in 2.36 seconds.                                               
17:32:22-287675 INFO     LOADER: "Transformers"                                                                                             
17:32:22-288040 INFO     TRUNCATION LENGTH: 131072                                                                                          
17:32:22-288376 INFO     INSTRUCTION TEMPLATE: "Custom (obtained from model metadata)"                                                      
17:32:31-917983 INFO     Loading raw text file dataset                                                                                      
17:32:32-609262 INFO     Getting model ready                                                                                                
17:32:32-610008 INFO     Preparing for training                                                                                             
17:32:32-610816 INFO     Creating LoRA model                                                                                                
/home/fed1/text-generation-webui-2.3/installer_files/env/lib/python3.11/site-packages/transformers/training_args.py:1575: FutureWarning: `evaluation_strategy` is deprecated and will be removed in version 4.46 of 🤗 Transformers. Use `eval_strategy` instead
  warnings.warn(
17:32:32-854659 INFO     Starting training                                                                                                  
Training 'llama' model using (q, v) projections
Trainable params: 55,050,240 (1.6846 %), All params: 3,267,800,064 (Model: 3,212,749,824)
17:32:32-865429 INFO     Log file 'train_dataset_sample.json' created in the 'logs' directory.                                              
Step: 159 {'loss': 2.3268, 'grad_norm': 0.3329761028289795, 'learning_rate': 0.0003, 'epoch': 0.8163265306122449}
Step: 319 {'loss': 2.0525, 'grad_norm': 0.36472970247268677, 'learning_rate': 0.0003, 'epoch': 1.489795918367347}
Step: 479 {'loss': 1.9251, 'grad_norm': 0.38644668459892273, 'learning_rate': 0.0003, 'epoch': 2.163265306122449}
Step: 575 {'train_runtime': 177.9808, 'train_samples_per_second': 13.198, 'train_steps_per_second': 0.101, 'train_loss': 2.050247483783298, 'epoch': 2.6530612244897958}
17:35:31-159644 INFO     LoRA training run is completed and saved.                                                                          
17:35:31-313371 INFO     Training complete, saving                                                                                          
17:35:31-496396 INFO     Training complete!                                                                                                 
^C

System Info

gpu 7900xt 
fedora 41
@q4wey q4wey added the bug Something isn't working label Jan 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant