Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

训练到第一个checkpoint,运行验证数据集的时候报错:TypeError: iteration over a 0-d tensor #15

Open
Shiquan0304 opened this issue Feb 25, 2025 · 1 comment

Comments

@Shiquan0304
Copy link

Shiquan0304 commented Feb 25, 2025

异常信息:

{'loss': 0.7334, 'grad_norm': 1.2644673585891724, 'learning_rate': 9.999978915433865e-06, 'epoch': 0.02}
{'loss': 0.7381, 'grad_norm': 1.2470377683639526, 'learning_rate': 9.999869007504867e-06, 'epoch': 0.02}
{'loss': 0.7243, 'grad_norm': 1.0461479425430298, 'learning_rate': 9.999665163306944e-06, 'epoch': 0.03}
1%|█▊ | 1000/73212 [2:46:04<199:21:07, 9.94s/it]
[INFO|trainer.py:4021] 2025-02-24 10:27:40,778 >>
***** Running Evaluation *****
[INFO|trainer.py:4023] 2025-02-24 10:27:40,779 >> Num examples = 4000
[INFO|trainer.py:4026] 2025-02-24 10:27:40,779 >> Batch size = 1
[rank1]: Traceback (most recent call last):
[rank1]: File "/data5/360-LLaMA-Factory/src/llamafactory/launcher.py", line 23, in
[rank1]: launch()
[rank1]: File "/data5/360-LLaMA-Factory/src/llamafactory/launcher.py", line 19, in launch
[rank1]: run_exp()
[rank1]: File "/data5/360-LLaMA-Factory/src/llamafactory/train/tuner.py", line 50, in run_exp
[rank1]: run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
[rank1]: File "/data5/360-LLaMA-Factory/src/llamafactory/train/sft/workflow.py", line 102, in run_sft
[rank1]: train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/data0/miniconda3/envs/360-llama-factory/lib/python3.11/site-packages/transformers/trainer.py", line 2052, in train
[rank1]: return inner_training_loop(
[rank1]: ^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/data0/miniconda3/envs/360-llama-factory/lib/python3.11/site-packages/transformers/trainer.py", line 2467, in _inner_training_loop
[rank1]: self._maybe_log_save_evaluate(tr_loss, grad_norm, model, trial, epoch, ignore_keys_for_eval)
[rank1]: File "/data0/miniconda3/envs/360-llama-factory/lib/python3.11/site-packages/transformers/trainer.py", line 2915, in _maybe_log_save_evaluate
[rank1]: metrics = self._evaluate(trial, ignore_keys_for_eval)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/data0/miniconda3/envs/360-llama-factory/lib/python3.11/site-packages/transformers/trainer.py", line 2872, in _evaluate
[rank1]: metrics = self.evaluate(ignore_keys=ignore_keys_for_eval)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/data0/miniconda3/envs/360-llama-factory/lib/python3.11/site-packages/transformers/trainer_seq2seq.py", line 180, in evaluate
[rank1]: return super().evaluate(eval_dataset, ignore_keys=ignore_keys, metric_key_prefix=metric_key_prefix)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/data0/miniconda3/envs/360-llama-factory/lib/python3.11/site-packages/transformers/trainer.py", line 3868, in evaluate
[rank1]: output = eval_loop(
[rank1]: ^^^^^^^^^^
[rank1]: File "/data0/miniconda3/envs/360-llama-factory/lib/python3.11/site-packages/transformers/trainer.py", line 4061, in evaluation_loop
[rank1]: losses, logits, labels = self.prediction_step(model, inputs, prediction_loss_only, ignore_keys=ignore_keys)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/data5/360-LLaMA-Factory/src/llamafactory/train/sft/trainer.py", line 174, in prediction_step
[rank1]: loss, generated_tokens, _ = super().prediction_step( # ignore the returned labels (may be truncated)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/data0/miniconda3/envs/360-llama-factory/lib/python3.11/site-packages/transformers/trainer_seq2seq.py", line 278, in prediction_step
[rank1]: return super().prediction_step(
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/data0/miniconda3/envs/360-llama-factory/lib/python3.11/site-packages/transformers/trainer.py", line 4279, in prediction_step
[rank1]: loss, outputs = self.compute_loss(model, inputs, return_outputs=True)
[rank1]: ^^^^^^^^^^^^^
[rank1]: File "/data0/miniconda3/envs/360-llama-factory/lib/python3.11/site-packages/torch/_tensor.py", line 1109, in iter
[rank1]: raise TypeError("iteration over a 0-d tensor")
[rank1]: TypeError: iteration over a 0-d tensor
[rank3]: Traceback (most recent call last):
[rank3]: File "/data5/360-LLaMA-Factory/src/llamafactory/launcher.py", line 23, in
[rank3]: launch()
[rank3]: File "/data5/360-LLaMA-Factory/src/llamafactory/launcher.py", line 19, in launch
[rank3]: run_exp()
[rank3]: File "/data5/360-LLaMA-Factory/src/llamafactory/train/tuner.py", line 50, in run_exp
[rank3]: run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
[rank3]: File "/data5/360-LLaMA-Factory/src/llamafactory/train/sft/workflow.py", line 102, in run_sft
[rank3]: train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

@HaoshengZou
Copy link
Collaborator

HaoshengZou commented Feb 26, 2025

validation with SP is not supported yet #2
With long context models we usually validate it with vllm inference + LLM judge on validation sets. Still, we do have later plans to support validation with SP as per #2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants