-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Open
Labels
bugSomething isn't workingSomething isn't working
Description
Describe the bug
What the bug is, and how to reproduce, better with screenshots(描述bug以及复现过程,最好有截图)
- script:
PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' \
CUDA_VISIBLE_DEVICES=0,1,2,3 \
MASTER_PORT=29501 \
NPROC_PER_NODE=4 \
swift rlhf \
--rlhf_type gkd \
--model /root/cache/hub/models--Qwen--Qwen2.5-VL-3B-Instruct/snapshots/66285546d2b821cf421d4f5eb2576359d3770cd3 \
--teacher_model /root/cache/hub/models--Qwen--Qwen2.5-VL-3B-Instruct/snapshots/66285546d2b821cf421d4f5eb2576359d3770cd3 \
--dataset 'modelscope/coco_2014_caption:validation#2000' \
--load_from_cache_file true \
--split_dataset_ratio 0.01 \
--train_type full \
--seq_kd true \
--torch_dtype bfloat16 \
--num_train_epochs 1 \
--per_device_train_batch_size 4 \
--per_device_eval_batch_size 4 \
--learning_rate 1e-5 \
--freeze_vit true \
--gradient_accumulation_steps 1 \
--eval_steps 50 \
--save_steps 50 \
--save_total_limit 2 \
--deepspeed zero2 \
--attn_impl flash_attn \
--logging_steps 5 \
--max_length 4096 \
--max_completion_length 512 \
--output_dir output \
--warmup_ratio 0.05 \
--dataloader_num_workers 4 \
--dataset_num_proc 4 \
--save_only_model true- bug:
Train: 0%| | 0/124 [00:00<?, ?it/s][rank2]: Traceback (most recent call last):
[rank2]: File "/apdcephfs_private/qy/projects/zy/BeeLinear/ms-swift/swift/cli/rlhf.py", line 7, in <module>
[rank2]: rlhf_main()
[rank2]: File "/apdcephfs_private/qy/projects/zy/BeeLinear/ms-swift/swift/llm/train/rlhf.py", line 233, in rlhf_main
[rank2]: return SwiftRLHF(args).main()
[rank2]: ^^^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/apdcephfs_private/qy/projects/zy/BeeLinear/ms-swift/swift/llm/base.py", line 49, in main
[rank2]: result = self.run()
[rank2]: ^^^^^^^^^^
[rank2]: File "/apdcephfs_private/qy/projects/zy/BeeLinear/ms-swift/swift/ray/base.py", line 170, in wrapper
[rank2]: return func(self, *args, **kwargs)
[rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/apdcephfs_private/qy/projects/zy/BeeLinear/ms-swift/swift/llm/train/sft.py", line 209, in run
[rank2]: return self.train(trainer)
[rank2]: ^^^^^^^^^^^^^^^^^^^
[rank2]: File "/apdcephfs_private/qy/projects/zy/BeeLinear/ms-swift/swift/llm/train/sft.py", line 257, in train
[rank2]: trainer.train(trainer.args.resume_from_checkpoint)
[rank2]: File "/apdcephfs_private/qy/projects/zy/BeeLinear/ms-swift/swift/trainers/mixin.py", line 840, in train
[rank2]: res = super().train(*args, **kwargs)
[rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/apdcephfs_private/qy/projects/zy/BeeLinear/transformers/src/transformers/trainer.py", line 2325, in train
[rank2]: return inner_training_loop(
[rank2]: ^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/apdcephfs_private/qy/projects/zy/BeeLinear/transformers/src/transformers/trainer.py", line 2674, in _inner_training_loop
[rank2]: tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
[rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/apdcephfs_private/qy/projects/zy/BeeLinear/ms-swift/swift/trainers/rlhf_trainer/utils.py", line 428, in wrapper
[rank2]: return func(self, *args, **kwargs)
[rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/apdcephfs_private/qy/projects/zy/BeeLinear/ms-swift/swift/trainers/rlhf_trainer/gkd_trainer.py", line 337, in training_step
[rank2]: inputs = self._prepare_batch_inputs(inputs)
[rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/apdcephfs_private/qy/projects/zy/BeeLinear/ms-swift/swift/trainers/rlhf_trainer/gkd_trainer.py", line 279, in _prepare_batch_inputs
[rank2]: batch_encoded = to_device(template.data_collator(batch_encoded_inputs), self.model.device)
[rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/apdcephfs_private/qy/projects/zy/BeeLinear/ms-swift/swift/llm/template/base.py", line 1453, in data_collator
[rank2]: res = self._gkd_data_collator(batch, padding_to=padding_to)
[rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/apdcephfs_private/qy/projects/zy/BeeLinear/ms-swift/swift/llm/template/base.py", line 1552, in _gkd_data_collator
[rank2]: prompts_res = self._data_collator(prompts_batch, padding_to=padding_to)
[rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/apdcephfs_private/qy/projects/zy/BeeLinear/ms-swift/swift/llm/template/template/qwen.py", line 452, in _data_collator
[rank2]: res['position_ids'] = self._get_position_ids(res)
[rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/apdcephfs_private/qy/projects/zy/BeeLinear/ms-swift/swift/llm/template/template/qwen.py", line 441, in _get_position_ids
[rank2]: position_ids, _ = get_rope_index(
[rank2]: ^^^^^^^^^^^^^^^
[rank2]: File "/apdcephfs_private/qy/projects/zy/BeeLinear/transformers/src/transformers/models/qwen2_5_vl/modeling_qwen2_5_vl.py", line 1125, in get_rope_index
[rank2]: max_position_ids = position_ids.max(0, keepdim=False)[0].max(-1, keepdim=True)[0]
[rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: IndexError: max(): Expected reduction dim 1 to have non-zero size.
[rank1]: Traceback (most recent call last):
[rank1]: File "/apdcephfs_private/qy/projects/zy/BeeLinear/ms-swift/swift/cli/rlhf.py", line 7, in <module>
[rank1]: rlhf_main()
[rank1]: File "/apdcephfs_private/qy/projects/zy/BeeLinear/ms-swift/swift/llm/train/rlhf.py", line 233, in rlhf_main
[rank1]: return SwiftRLHF(args).main()
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/apdcephfs_private/qy/projects/zy/BeeLinear/ms-swift/swift/llm/base.py", line 49, in main
[rank1]: result = self.run()
[rank1]: ^^^^^^^^^^
[rank1]: File "/apdcephfs_private/qy/projects/zy/BeeLinear/ms-swift/swift/ray/base.py", line 170, in wrapper
[rank1]: return func(self, *args, **kwargs)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/apdcephfs_private/qy/projects/zy/BeeLinear/ms-swift/swift/llm/train/sft.py", line 209, in run
[rank1]: return self.train(trainer)
[rank1]: ^^^^^^^^^^^^^^^^^^^
[rank1]: File "/apdcephfs_private/qy/projects/zy/BeeLinear/ms-swift/swift/llm/train/sft.py", line 257, in train
[rank1]: trainer.train(trainer.args.resume_from_checkpoint)
[rank1]: File "/apdcephfs_private/qy/projects/zy/BeeLinear/ms-swift/swift/trainers/mixin.py", line 840, in train
[rank1]: res = super().train(*args, **kwargs)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/apdcephfs_private/qy/projects/zy/BeeLinear/transformers/src/transformers/trainer.py", line 2325, in train
[rank1]: return inner_training_loop(
[rank1]: ^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/apdcephfs_private/qy/projects/zy/BeeLinear/transformers/src/transformers/trainer.py", line 2674, in _inner_training_loop
[rank1]: tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/apdcephfs_private/qy/projects/zy/BeeLinear/ms-swift/swift/trainers/rlhf_trainer/utils.py", line 428, in wrapper
[rank1]: return func(self, *args, **kwargs)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/apdcephfs_private/qy/projects/zy/BeeLinear/ms-swift/swift/trainers/rlhf_trainer/gkd_trainer.py", line 337, in training_step
[rank1]: inputs = self._prepare_batch_inputs(inputs)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/apdcephfs_private/qy/projects/zy/BeeLinear/ms-swift/swift/trainers/rlhf_trainer/gkd_trainer.py", line 279, in _prepare_batch_inputs
[rank1]: batch_encoded = to_device(template.data_collator(batch_encoded_inputs), self.model.device)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/apdcephfs_private/qy/projects/zy/BeeLinear/ms-swift/swift/llm/template/base.py", line 1453, in data_collator
[rank1]: res = self._gkd_data_collator(batch, padding_to=padding_to)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/apdcephfs_private/qy/projects/zy/BeeLinear/ms-swift/swift/llm/template/base.py", line 1552, in _gkd_data_collator
[rank1]: prompts_res = self._data_collator(prompts_batch, padding_to=padding_to)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/apdcephfs_private/qy/projects/zy/BeeLinear/ms-swift/swift/llm/template/template/qwen.py", line 452, in _data_collator
[rank1]: res['position_ids'] = self._get_position_ids(res)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/apdcephfs_private/qy/projects/zy/BeeLinear/ms-swift/swift/llm/template/template/qwen.py", line 441, in _get_position_ids
[rank1]: position_ids, _ = get_rope_index(
[rank1]: ^^^^^^^^^^^^^^^
[rank1]: File "/apdcephfs_private/qy/projects/zy/BeeLinear/transformers/src/transformers/models/qwen2_5_vl/modeling_qwen2_5_vl.py", line 1125, in get_rope_index
[rank1]: max_position_ids = position_ids.max(0, keepdim=False)[0].max(-1, keepdim=True)[0]
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: IndexError: max(): Expected reduction dim 1 to have non-zero size.
[INFO:swift] last_model_checkpoint: None
[INFO:swift] best_model_checkpoint: None
[INFO:swift] images_dir: /apdcephfs_private/qy/projects/zy/BeeLinear/ms-swift/output/v7-20251212-162037/images
[rank0]: Traceback (most recent call last):
[rank0]: File "/apdcephfs_private/qy/projects/zy/BeeLinear/ms-swift/swift/cli/rlhf.py", line 7, in <module>
[rank0]: rlhf_main()
[rank0]: File "/apdcephfs_private/qy/projects/zy/BeeLinear/ms-swift/swift/llm/train/rlhf.py", line 233, in rlhf_main
[rank0]: return SwiftRLHF(args).main()
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/apdcephfs_private/qy/projects/zy/BeeLinear/ms-swift/swift/llm/base.py", line 49, in main
[rank0]: result = self.run()
[rank0]: ^^^^^^^^^^
[rank0]: File "/apdcephfs_private/qy/projects/zy/BeeLinear/ms-swift/swift/ray/base.py", line 170, in wrapper
[rank0]: return func(self, *args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/apdcephfs_private/qy/projects/zy/BeeLinear/ms-swift/swift/llm/train/sft.py", line 209, in run
[rank0]: return self.train(trainer)
[rank0]: ^^^^^^^^^^^^^^^^^^^
[rank0]: File "/apdcephfs_private/qy/projects/zy/BeeLinear/ms-swift/swift/llm/train/sft.py", line 257, in train
[rank0]: trainer.train(trainer.args.resume_from_checkpoint)
[rank0]: File "/apdcephfs_private/qy/projects/zy/BeeLinear/ms-swift/swift/trainers/mixin.py", line 840, in train
[rank0]: res = super().train(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/apdcephfs_private/qy/projects/zy/BeeLinear/transformers/src/transformers/trainer.py", line 2325, in train
[rank0]: return inner_training_loop(
[rank0]: ^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/apdcephfs_private/qy/projects/zy/BeeLinear/transformers/src/transformers/trainer.py", line 2674, in _inner_training_loop
[rank0]: tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/apdcephfs_private/qy/projects/zy/BeeLinear/ms-swift/swift/trainers/rlhf_trainer/utils.py", line 428, in wrapper
[rank0]: return func(self, *args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/apdcephfs_private/qy/projects/zy/BeeLinear/ms-swift/swift/trainers/rlhf_trainer/gkd_trainer.py", line 337, in training_step
[rank0]: inputs = self._prepare_batch_inputs(inputs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/apdcephfs_private/qy/projects/zy/BeeLinear/ms-swift/swift/trainers/rlhf_trainer/gkd_trainer.py", line 279, in _prepare_batch_inputs
[rank0]: batch_encoded = to_device(template.data_collator(batch_encoded_inputs), self.model.device)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/apdcephfs_private/qy/projects/zy/BeeLinear/ms-swift/swift/llm/template/base.py", line 1453, in data_collator
[rank0]: res = self._gkd_data_collator(batch, padding_to=padding_to)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/apdcephfs_private/qy/projects/zy/BeeLinear/ms-swift/swift/llm/template/base.py", line 1552, in _gkd_data_collator
[rank0]: prompts_res = self._data_collator(prompts_batch, padding_to=padding_to)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/apdcephfs_private/qy/projects/zy/BeeLinear/ms-swift/swift/llm/template/template/qwen.py", line 452, in _data_collator
[rank0]: res['position_ids'] = self._get_position_ids(res)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/apdcephfs_private/qy/projects/zy/BeeLinear/ms-swift/swift/llm/template/template/qwen.py", line 441, in _get_position_ids
[rank0]: position_ids, _ = get_rope_index(
[rank0]: ^^^^^^^^^^^^^^^
[rank0]: File "/apdcephfs_private/qy/projects/zy/BeeLinear/transformers/src/transformers/models/qwen2_5_vl/modeling_qwen2_5_vl.py", line 1125, in get_rope_index
[rank0]: max_position_ids = position_ids.max(0, keepdim=False)[0].max(-1, keepdim=True)[0]
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: IndexError: max(): Expected reduction dim 1 to have non-zero size.
Train: 0%| | 0/124 [00:10<?, ?it/s]
[rank0]:[W1212 16:21:14.892425929 ProcessGroupNCCL.cpp:1538] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
[rank3]: Traceback (most recent call last):
[rank3]: File "/apdcephfs_private/qy/projects/zy/BeeLinear/ms-swift/swift/cli/rlhf.py", line 7, in <module>
[rank3]: rlhf_main()
[rank3]: File "/apdcephfs_private/qy/projects/zy/BeeLinear/ms-swift/swift/llm/train/rlhf.py", line 233, in rlhf_main
[rank3]: return SwiftRLHF(args).main()
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/apdcephfs_private/qy/projects/zy/BeeLinear/ms-swift/swift/llm/base.py", line 49, in main
[rank3]: result = self.run()
[rank3]: ^^^^^^^^^^
[rank3]: File "/apdcephfs_private/qy/projects/zy/BeeLinear/ms-swift/swift/ray/base.py", line 170, in wrapper
[rank3]: return func(self, *args, **kwargs)
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/apdcephfs_private/qy/projects/zy/BeeLinear/ms-swift/swift/llm/train/sft.py", line 209, in run
[rank3]: return self.train(trainer)
[rank3]: ^^^^^^^^^^^^^^^^^^^
[rank3]: File "/apdcephfs_private/qy/projects/zy/BeeLinear/ms-swift/swift/llm/train/sft.py", line 257, in train
[rank3]: trainer.train(trainer.args.resume_from_checkpoint)
[rank3]: File "/apdcephfs_private/qy/projects/zy/BeeLinear/ms-swift/swift/trainers/mixin.py", line 840, in train
[rank3]: res = super().train(*args, **kwargs)
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/apdcephfs_private/qy/projects/zy/BeeLinear/transformers/src/transformers/trainer.py", line 2325, in train
[rank3]: return inner_training_loop(
[rank3]: ^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/apdcephfs_private/qy/projects/zy/BeeLinear/transformers/src/transformers/trainer.py", line 2674, in _inner_training_loop
[rank3]: tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/apdcephfs_private/qy/projects/zy/BeeLinear/ms-swift/swift/trainers/rlhf_trainer/utils.py", line 428, in wrapper
[rank3]: return func(self, *args, **kwargs)
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/apdcephfs_private/qy/projects/zy/BeeLinear/ms-swift/swift/trainers/rlhf_trainer/gkd_trainer.py", line 337, in training_step
[rank3]: inputs = self._prepare_batch_inputs(inputs)
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/apdcephfs_private/qy/projects/zy/BeeLinear/ms-swift/swift/trainers/rlhf_trainer/gkd_trainer.py", line 279, in _prepare_batch_inputs
[rank3]: batch_encoded = to_device(template.data_collator(batch_encoded_inputs), self.model.device)
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/apdcephfs_private/qy/projects/zy/BeeLinear/ms-swift/swift/llm/template/base.py", line 1453, in data_collator
[rank3]: res = self._gkd_data_collator(batch, padding_to=padding_to)
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/apdcephfs_private/qy/projects/zy/BeeLinear/ms-swift/swift/llm/template/base.py", line 1552, in _gkd_data_collator
[rank3]: prompts_res = self._data_collator(prompts_batch, padding_to=padding_to)
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/apdcephfs_private/qy/projects/zy/BeeLinear/ms-swift/swift/llm/template/template/qwen.py", line 452, in _data_collator
[rank3]: res['position_ids'] = self._get_position_ids(res)
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/apdcephfs_private/qy/projects/zy/BeeLinear/ms-swift/swift/llm/template/template/qwen.py", line 441, in _get_position_ids
[rank3]: position_ids, _ = get_rope_index(
[rank3]: ^^^^^^^^^^^^^^^
[rank3]: File "/apdcephfs_private/qy/projects/zy/BeeLinear/transformers/src/transformers/models/qwen2_5_vl/modeling_qwen2_5_vl.py", line 1125, in get_rope_index
[rank3]: max_position_ids = position_ids.max(0, keepdim=False)[0].max(-1, keepdim=True)[0]
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: IndexError: max(): Expected reduction dim 1 to have non-zero size.
W1212 16:21:15.710000 158266 site-packages/torch/distributed/elastic/multiprocessing/api.py:900] Sending process 158339 closing signal SIGTERM
W1212 16:21:15.711000 158266 site-packages/torch/distributed/elastic/multiprocessing/api.py:900] Sending process 158341 closing signal SIGTERM
W1212 16:21:15.713000 158266 site-packages/torch/distributed/elastic/multiprocessing/api.py:900] Sending process 158342 closing signal SIGTERM
E1212 16:21:16.697000 158266 site-packages/torch/distributed/elastic/multiprocessing/api.py:874] failed (exitcode: 1) local_rank: 1 (pid: 158340) of binary: /jizhicfs/leoyizhang/anaconda3/envs/beelinear/bin/python3.12
Traceback (most recent call last):
File "<frozen runpy>", line 198, in _run_module_as_main
File "<frozen runpy>", line 88, in _run_code
File "/jizhicfs/leoyizhang/anaconda3/envs/beelinear/lib/python3.12/site-packages/torch/distributed/run.py", line 905, in <module>
main()
File "/jizhicfs/leoyizhang/anaconda3/envs/beelinear/lib/python3.12/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 357, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/jizhicfs/leoyizhang/anaconda3/envs/beelinear/lib/python3.12/site-packages/torch/distributed/run.py", line 901, in main
run(args)
File "/jizhicfs/leoyizhang/anaconda3/envs/beelinear/lib/python3.12/site-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/jizhicfs/leoyizhang/anaconda3/envs/beelinear/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 143, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/jizhicfs/leoyizhang/anaconda3/envs/beelinear/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 277, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
/apdcephfs_private/qy/projects/zy/BeeLinear/ms-swift/swift/cli/rlhf.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2025-12-12_16:21:15
host : TENCENT64.site
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 158340)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
Your hardware and system info
Write your system info like CUDA version/system/GPU/torch version here(在这里给出硬件信息和系统信息,如CUDA版本,系统,GPU型号和torch版本等)
CUDA: 12.9
GPU: H20
Package Version Editable project location
----------------------------- ------------ ------------------------------------------------------------------
abnf 2.2.0
absl-py 2.3.1
accelerate 1.12.0
addict 2.4.0
aiofiles 24.1.0
aiohappyeyeballs 2.6.1
aiohttp 3.13.2
aiosignal 1.4.0
aliyun-python-sdk-core 2.16.0
aliyun-python-sdk-kms 2.16.5
annotated-doc 0.0.4
annotated-types 0.7.0
antlr4-python3-runtime 4.9.3
anyio 4.11.0
attrdict 2.0.1
attrs 25.4.0
av 16.0.1
backoff 2.2.1
binpacking 1.5.2
bitsandbytes 0.48.2
brotli 1.2.0
causal_conv1d 1.5.4
certifi 2025.11.12
cffi 2.0.0
chardet 5.2.0
charset-normalizer 3.4.4
cint 1.0.0
click 8.3.1
contourpy 1.3.3
cpm-kernels 1.0.11
crcmod 1.7
cryptography 46.0.3
cut-cross-entropy 25.1.1
cycler 0.12.1
dacite 1.9.2
datasets 3.6.0
deepspeed 0.18.2
diffusers 0.35.2
dill 0.3.8
diskcache 5.6.3
distro 1.9.0
docstring_parser 0.17.0
einops 0.8.1
fastapi 0.122.0
ffmpy 1.0.0
fickling 0.1.5
filelock 3.20.0
fla-core 0.4.0
flash_attn 2.8.0.post2
flash-linear-attention 0.4.0
fonttools 4.60.1
frozenlist 1.8.0
fsspec 2025.3.0
future 1.0.0
gitdb 4.0.12
GitPython 3.1.45
gql 4.0.0
gradio 6.0.1
gradio_client 2.0.0
graphql-core 3.2.7
graphviz 0.21
groovy 0.1.2
grpcio 1.76.0
h11 0.16.0
hf_transfer 0.1.9
hf-xet 1.2.0
hjson 3.1.0
httpcore 1.0.9
httpx 0.28.1
huggingface-hub 0.36.0
idna 3.11
importlib_metadata 8.7.0
intervaltree 3.1.0
jieba 0.42.1
Jinja2 3.1.6
jiter 0.12.0
jmespath 0.10.0
joblib 1.5.2
json_repair 0.54.2
jsonschema 4.25.1
jsonschema-specifications 2025.9.1
kaitaistruct 0.11
kiwisolver 1.4.9
liger_kernel 0.6.4
Markdown 3.10
markdown-it-py 4.0.0
MarkupSafe 3.0.3
matplotlib 3.10.7
mdurl 0.1.2
modelscope 1.32.0
mpi4py 4.1.1
mpmath 1.3.0
ms_swift 3.11.0.dev0
msgpack 1.1.2
msgspec 0.20.0
multidict 6.7.0
multiprocess 0.70.16
networkx 3.6
ninja 1.13.0
nltk 3.9.2
numpy 2.3.5
nvidia-cublas-cu12 12.8.4.1
nvidia-cuda-cupti-cu12 12.8.90
nvidia-cuda-nvrtc-cu12 12.8.93
nvidia-cuda-runtime-cu12 12.8.90
nvidia-cudnn-cu12 9.10.2.21
nvidia-cufft-cu12 11.3.3.83
nvidia-cufile-cu12 1.13.1.3
nvidia-curand-cu12 10.3.9.90
nvidia-cusolver-cu12 11.7.3.90
nvidia-cusparse-cu12 12.5.8.93
nvidia-cusparselt-cu12 0.7.1
nvidia-ml-py 13.580.82
nvidia-nccl-cu12 2.27.3
nvidia-nvjitlink-cu12 12.8.93
nvidia-nvtx-cu12 12.8.90
omegaconf 2.3.0
openai 2.8.1
orjson 3.11.4
oss2 2.19.1
packaging 25.0
pandas 2.3.3
pdfminer.six 20250506
peft 0.18.0
pillow 12.0.0
pip 25.3
platformdirs 4.5.0
polyfile-weave 0.5.7
propcache 0.4.1
protobuf 6.33.1
psutil 7.1.3
py-cpuinfo 9.0.0
pyarrow 22.0.0
pycparser 2.23
pycryptodome 3.23.0
pydantic 2.12.4
pydantic_core 2.41.5
pydub 0.25.1
Pygments 2.19.2
pyparsing 3.2.5
python-dateutil 2.9.0.post0
python-multipart 0.0.20
pytz 2025.2
PyYAML 6.0.3
qwen-vl-utils 0.0.14
referencing 0.37.0
regex 2025.11.3
requests 2.32.5
rich 14.2.0
rouge 1.0.1
rpds-py 0.29.0
safehttpx 0.1.7
safetensors 0.7.0
scipy 1.16.3
semantic-version 2.10.0
sentencepiece 0.2.1
sentry-sdk 2.46.0
setuptools 80.9.0
shellingham 1.5.4
shtab 1.8.0
simplejson 3.20.2
six 1.17.0
smmap 5.0.2
sniffio 1.3.1
sortedcontainers 2.4.0
starlette 0.50.0
stdlib-list 0.11.1
sympy 1.14.0
tenacity 9.1.2
tensorboard 2.20.0
tensorboard-data-server 0.7.2
tiktoken 0.12.0
tokenizers 0.22.1
tomlkit 0.13.3
torch 2.8.0
torchao 0.13.0
torchaudio 2.8.0+cu128
torchvision 0.23.0+cu128
tqdm 4.67.1
transformers 4.57.1
transformers-stream-generator 0.0.5
triton 3.4.0
trl 0.24.0
typeguard 4.4.4
typer 0.20.0
typing_extensions 4.15.0
typing-inspection 0.4.2
tyro 0.9.35
tzdata 2025.2
unsloth 2025.11.4
unsloth_zoo 2025.11.5
urllib3 2.5.0
uvicorn 0.38.0
wandb 0.23.0
weave 0.52.20
Werkzeug 3.1.3
wheel 0.45.1
xformers 0.0.32.post2
xxhash 3.6.0
yarl 1.22.0
zipp 3.23.0
zstandard 0.25.0
Additional context
Add any other context about the problem here(在这里补充其他信息)
In the same environment, sft can be trained without error.
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working