You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
using the copy parameters tab or using the checkpoint, training seems to restart when trying to resume
Is there an existing issue for this?
I have searched the existing issues
Reproduction
steps to reproduce-
load the webui using the start script
load the model
go into the training tab and name the model (plants) then select the text file for training
start training
monitor the loss(keep as a record will need for later)
once training has been completed and the webui says lora saved, close the start script terminal
Run the start script again and load up the webui
load the same model
in the training tap select copy parameters from plants
start training and monitor the loss of the second round and compare it with the first
I have found the loss to be mostly identical. during both rounds.
If you copy all the contents of latest checkpoint folder into the lora folder reload the model and try to resume training the loss is back to where it was at the beginning. I have also read #1459
Screenshot
No response
Logs
17:10:24-032063 INFO Loading "chuanli11_Llama-3.2-3B-Instruct-uncensored"
17:10:24-130281 INFO TRANSFORMERS_PARAMS=
{'low_cpu_mem_usage': True, 'torch_dtype': torch.bfloat16}
/home/fed1/text-generation-webui-2.3/installer_files/env/lib/python3.11/site-packages/transformers/generation/configuration_utils.py:638: UserWarning: `do_sample` is set to `False`. However, `min_p` is set to `0.0` -- this flag is only used in sample-based generation modes. You should set`do_sample=True` or unset`min_p`.
warnings.warn(
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████| 2/2 [00:01<00:00, 1.88it/s]
17:10:30-080024 INFO Loaded "chuanli11_Llama-3.2-3B-Instruct-uncensored"in 6.05 seconds.
17:10:30-080667 INFO LOADER: "Transformers"
17:10:30-080986 INFO TRUNCATION LENGTH: 131072
17:10:30-081290 INFO INSTRUCTION TEMPLATE: "Custom (obtained from model metadata)"
17:11:50-316161 INFO Loading raw text file dataset
17:11:50-997368 INFO Getting model ready
17:11:50-998010 INFO Preparing for training
17:11:50-998716 INFO Creating LoRA model
/home/fed1/text-generation-webui-2.3/installer_files/env/lib/python3.11/site-packages/transformers/training_args.py:1575: FutureWarning: `evaluation_strategy` is deprecated and will be removed in version 4.46 of 🤗 Transformers. Use `eval_strategy` instead
warnings.warn(
17:11:52-163364 INFO Starting training
Training 'llama' model using (q, v) projections
Trainable params: 55,050,240 (1.6846 %), All params: 3,267,800,064 (Model: 3,212,749,824)
17:11:52-180231 INFO Log file 'train_dataset_sample.json' created in the 'logs' directory.
wandb: WARNING The `run_name` is currently set to the same value as `TrainingArguments.output_dir`. If this was not intended, please specify a different run name by setting the `TrainingArguments.run_name` parameter.
wandb: Using wandb-core as the SDK backend. Please refer to https://wandb.me/wandb-core for more information.
wandb: Tracking run with wandb version 0.19.4
wandb: W&B syncing is set to `offline`in this directory.
wandb: Run `wandb online` or set WANDB_MODE=online to enable cloud syncing.
Step: 159 {'loss': 2.3269, 'grad_norm': 0.33446332812309265, 'learning_rate': 0.0003, 'epoch': 0.8163265306122449}
Step: 319 {'loss': 2.0527, 'grad_norm': 0.3678511083126068, 'learning_rate': 0.0003, 'epoch': 1.489795918367347}
Step: 479 {'loss': 1.9253, 'grad_norm': 0.3888787031173706, 'learning_rate': 0.0003, 'epoch': 2.163265306122449}
Step: 575 {'train_runtime': 177.1225, 'train_samples_per_second': 13.262, 'train_steps_per_second': 0.102, 'train_loss': 2.05030984348721, 'epoch': 2.6530612244897958}
17:14:49-567885 INFO LoRA training run is completed and saved.
17:14:49-940228 INFO Training complete, saving
17:14:50-146733 INFO Training complete!
17:19:15-632075 INFO Loading "chuanli11_Llama-3.2-3B-Instruct-uncensored"
17:19:15-755350 INFO TRANSFORMERS_PARAMS=
{'low_cpu_mem_usage': True, 'torch_dtype': torch.bfloat16}
/home/fed1/text-generation-webui-2.3/installer_files/env/lib/python3.11/site-packages/transformers/generation/configuration_utils.py:638: UserWarning: `do_sample` is set to `False`. However, `min_p` is set to `0.0` -- this flag is only used in sample-based generation modes. You should set`do_sample=True` or unset`min_p`.
warnings.warn(
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 8.39it/s]
17:19:18-007344 INFO Loaded "chuanli11_Llama-3.2-3B-Instruct-uncensored"in 2.37 seconds.
17:19:18-007972 INFO LOADER: "Transformers"
17:19:18-008310 INFO TRUNCATION LENGTH: 131072
17:19:18-008617 INFO INSTRUCTION TEMPLATE: "Custom (obtained from model metadata)"
17:19:29-696583 INFO Loading raw text file dataset
17:19:30-382638 INFO Getting model ready
17:19:30-383396 INFO Preparing for training
17:19:30-384255 INFO Creating LoRA model
/home/fed1/text-generation-webui-2.3/installer_files/env/lib/python3.11/site-packages/transformers/training_args.py:1575: FutureWarning: `evaluation_strategy` is deprecated and will be removed in version 4.46 of 🤗 Transformers. Use `eval_strategy` instead
warnings.warn(
17:19:30-627227 INFO Starting training
Training 'llama' model using (q, v) projections
Trainable params: 55,050,240 (1.6846 %), All params: 3,267,800,064 (Model: 3,212,749,824)
17:19:30-637081 INFO Log file 'train_dataset_sample.json' created in the 'logs' directory.
Step: 159 {'loss': 2.3268, 'grad_norm': 0.3329761028289795, 'learning_rate': 0.0003, 'epoch': 0.8163265306122449}
Step: 319 {'loss': 2.0525, 'grad_norm': 0.36472970247268677, 'learning_rate': 0.0003, 'epoch': 1.489795918367347}
Step: 479 {'loss': 1.9251, 'grad_norm': 0.38644668459892273, 'learning_rate': 0.0003, 'epoch': 2.163265306122449}
Step: 575 {'train_runtime': 177.3279, 'train_samples_per_second': 13.247, 'train_steps_per_second': 0.102, 'train_loss': 2.050247483783298, 'epoch': 2.6530612244897958}
17:22:28-290855 INFO LoRA training run is completed and saved.
17:22:28-601034 INFO Training complete, saving
17:22:28-788072 INFO Training complete!
17:32:19-924805 INFO Loading "chuanli11_Llama-3.2-3B-Instruct-uncensored"
17:32:20-037364 INFO TRANSFORMERS_PARAMS=
{'low_cpu_mem_usage': True, 'torch_dtype': torch.bfloat16}
/home/fed1/text-generation-webui-2.3/installer_files/env/lib/python3.11/site-packages/transformers/generation/configuration_utils.py:638: UserWarning: `do_sample` is set to `False`. However, `min_p` is set to `0.0` -- this flag is only used in sample-based generation modes. You should set`do_sample=True` or unset`min_p`.
warnings.warn(
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 8.54it/s]
17:32:22-286983 INFO Loaded "chuanli11_Llama-3.2-3B-Instruct-uncensored"in 2.36 seconds.
17:32:22-287675 INFO LOADER: "Transformers"
17:32:22-288040 INFO TRUNCATION LENGTH: 131072
17:32:22-288376 INFO INSTRUCTION TEMPLATE: "Custom (obtained from model metadata)"
17:32:31-917983 INFO Loading raw text file dataset
17:32:32-609262 INFO Getting model ready
17:32:32-610008 INFO Preparing for training
17:32:32-610816 INFO Creating LoRA model
/home/fed1/text-generation-webui-2.3/installer_files/env/lib/python3.11/site-packages/transformers/training_args.py:1575: FutureWarning: `evaluation_strategy` is deprecated and will be removed in version 4.46 of 🤗 Transformers. Use `eval_strategy` instead
warnings.warn(
17:32:32-854659 INFO Starting training
Training 'llama' model using (q, v) projections
Trainable params: 55,050,240 (1.6846 %), All params: 3,267,800,064 (Model: 3,212,749,824)
17:32:32-865429 INFO Log file 'train_dataset_sample.json' created in the 'logs' directory.
Step: 159 {'loss': 2.3268, 'grad_norm': 0.3329761028289795, 'learning_rate': 0.0003, 'epoch': 0.8163265306122449}
Step: 319 {'loss': 2.0525, 'grad_norm': 0.36472970247268677, 'learning_rate': 0.0003, 'epoch': 1.489795918367347}
Step: 479 {'loss': 1.9251, 'grad_norm': 0.38644668459892273, 'learning_rate': 0.0003, 'epoch': 2.163265306122449}
Step: 575 {'train_runtime': 177.9808, 'train_samples_per_second': 13.198, 'train_steps_per_second': 0.101, 'train_loss': 2.050247483783298, 'epoch': 2.6530612244897958}
17:35:31-159644 INFO LoRA training run is completed and saved.
17:35:31-313371 INFO Training complete, saving
17:35:31-496396 INFO Training complete!
^C
System Info
gpu 7900xt
fedora 41
The text was updated successfully, but these errors were encountered:
Describe the bug
using the copy parameters tab or using the checkpoint, training seems to restart when trying to resume
Is there an existing issue for this?
Reproduction
steps to reproduce-
I have found the loss to be mostly identical. during both rounds.
If you copy all the contents of latest checkpoint folder into the lora folder reload the model and try to resume training the loss is back to where it was at the beginning. I have also read #1459
Screenshot
No response
Logs
System Info
The text was updated successfully, but these errors were encountered: