Skip to content

[BUG] Error when training Qwen2.5-VL with GKD #7022

@uyzhang

Description

@uyzhang

Describe the bug
What the bug is, and how to reproduce, better with screenshots(描述bug以及复现过程,最好有截图)

  • script:
PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' \
CUDA_VISIBLE_DEVICES=0,1,2,3 \
MASTER_PORT=29501 \
NPROC_PER_NODE=4 \
swift rlhf \
    --rlhf_type gkd \
    --model /root/cache/hub/models--Qwen--Qwen2.5-VL-3B-Instruct/snapshots/66285546d2b821cf421d4f5eb2576359d3770cd3 \
    --teacher_model /root/cache/hub/models--Qwen--Qwen2.5-VL-3B-Instruct/snapshots/66285546d2b821cf421d4f5eb2576359d3770cd3 \
    --dataset 'modelscope/coco_2014_caption:validation#2000' \
    --load_from_cache_file true \
    --split_dataset_ratio 0.01 \
    --train_type full \
    --seq_kd true \
    --torch_dtype bfloat16 \
    --num_train_epochs 1 \
    --per_device_train_batch_size 4 \
    --per_device_eval_batch_size 4 \
    --learning_rate 1e-5 \
    --freeze_vit true \
    --gradient_accumulation_steps 1 \
    --eval_steps 50 \
    --save_steps 50 \
    --save_total_limit 2 \
    --deepspeed zero2 \
    --attn_impl flash_attn \
    --logging_steps 5 \
    --max_length 4096 \
    --max_completion_length 512 \
    --output_dir output \
    --warmup_ratio 0.05 \
    --dataloader_num_workers 4 \
    --dataset_num_proc 4 \
    --save_only_model true
  • bug:
Train:   0%|          | 0/124 [00:00<?, ?it/s][rank2]: Traceback (most recent call last):
[rank2]:   File "/apdcephfs_private/qy/projects/zy/BeeLinear/ms-swift/swift/cli/rlhf.py", line 7, in <module>
[rank2]:     rlhf_main()
[rank2]:   File "/apdcephfs_private/qy/projects/zy/BeeLinear/ms-swift/swift/llm/train/rlhf.py", line 233, in rlhf_main
[rank2]:     return SwiftRLHF(args).main()
[rank2]:            ^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/apdcephfs_private/qy/projects/zy/BeeLinear/ms-swift/swift/llm/base.py", line 49, in main
[rank2]:     result = self.run()
[rank2]:              ^^^^^^^^^^
[rank2]:   File "/apdcephfs_private/qy/projects/zy/BeeLinear/ms-swift/swift/ray/base.py", line 170, in wrapper
[rank2]:     return func(self, *args, **kwargs)
[rank2]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/apdcephfs_private/qy/projects/zy/BeeLinear/ms-swift/swift/llm/train/sft.py", line 209, in run
[rank2]:     return self.train(trainer)
[rank2]:            ^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/apdcephfs_private/qy/projects/zy/BeeLinear/ms-swift/swift/llm/train/sft.py", line 257, in train
[rank2]:     trainer.train(trainer.args.resume_from_checkpoint)
[rank2]:   File "/apdcephfs_private/qy/projects/zy/BeeLinear/ms-swift/swift/trainers/mixin.py", line 840, in train
[rank2]:     res = super().train(*args, **kwargs)
[rank2]:           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/apdcephfs_private/qy/projects/zy/BeeLinear/transformers/src/transformers/trainer.py", line 2325, in train
[rank2]:     return inner_training_loop(
[rank2]:            ^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/apdcephfs_private/qy/projects/zy/BeeLinear/transformers/src/transformers/trainer.py", line 2674, in _inner_training_loop
[rank2]:     tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
[rank2]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/apdcephfs_private/qy/projects/zy/BeeLinear/ms-swift/swift/trainers/rlhf_trainer/utils.py", line 428, in wrapper
[rank2]:     return func(self, *args, **kwargs)
[rank2]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/apdcephfs_private/qy/projects/zy/BeeLinear/ms-swift/swift/trainers/rlhf_trainer/gkd_trainer.py", line 337, in training_step
[rank2]:     inputs = self._prepare_batch_inputs(inputs)
[rank2]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/apdcephfs_private/qy/projects/zy/BeeLinear/ms-swift/swift/trainers/rlhf_trainer/gkd_trainer.py", line 279, in _prepare_batch_inputs
[rank2]:     batch_encoded = to_device(template.data_collator(batch_encoded_inputs), self.model.device)
[rank2]:                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/apdcephfs_private/qy/projects/zy/BeeLinear/ms-swift/swift/llm/template/base.py", line 1453, in data_collator
[rank2]:     res = self._gkd_data_collator(batch, padding_to=padding_to)
[rank2]:           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/apdcephfs_private/qy/projects/zy/BeeLinear/ms-swift/swift/llm/template/base.py", line 1552, in _gkd_data_collator
[rank2]:     prompts_res = self._data_collator(prompts_batch, padding_to=padding_to)
[rank2]:                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/apdcephfs_private/qy/projects/zy/BeeLinear/ms-swift/swift/llm/template/template/qwen.py", line 452, in _data_collator
[rank2]:     res['position_ids'] = self._get_position_ids(res)
[rank2]:                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/apdcephfs_private/qy/projects/zy/BeeLinear/ms-swift/swift/llm/template/template/qwen.py", line 441, in _get_position_ids
[rank2]:     position_ids, _ = get_rope_index(
[rank2]:                       ^^^^^^^^^^^^^^^
[rank2]:   File "/apdcephfs_private/qy/projects/zy/BeeLinear/transformers/src/transformers/models/qwen2_5_vl/modeling_qwen2_5_vl.py", line 1125, in get_rope_index
[rank2]:     max_position_ids = position_ids.max(0, keepdim=False)[0].max(-1, keepdim=True)[0]
[rank2]:                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: IndexError: max(): Expected reduction dim 1 to have non-zero size.
[rank1]: Traceback (most recent call last):
[rank1]:   File "/apdcephfs_private/qy/projects/zy/BeeLinear/ms-swift/swift/cli/rlhf.py", line 7, in <module>
[rank1]:     rlhf_main()
[rank1]:   File "/apdcephfs_private/qy/projects/zy/BeeLinear/ms-swift/swift/llm/train/rlhf.py", line 233, in rlhf_main
[rank1]:     return SwiftRLHF(args).main()
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/apdcephfs_private/qy/projects/zy/BeeLinear/ms-swift/swift/llm/base.py", line 49, in main
[rank1]:     result = self.run()
[rank1]:              ^^^^^^^^^^
[rank1]:   File "/apdcephfs_private/qy/projects/zy/BeeLinear/ms-swift/swift/ray/base.py", line 170, in wrapper
[rank1]:     return func(self, *args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/apdcephfs_private/qy/projects/zy/BeeLinear/ms-swift/swift/llm/train/sft.py", line 209, in run
[rank1]:     return self.train(trainer)
[rank1]:            ^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/apdcephfs_private/qy/projects/zy/BeeLinear/ms-swift/swift/llm/train/sft.py", line 257, in train
[rank1]:     trainer.train(trainer.args.resume_from_checkpoint)
[rank1]:   File "/apdcephfs_private/qy/projects/zy/BeeLinear/ms-swift/swift/trainers/mixin.py", line 840, in train
[rank1]:     res = super().train(*args, **kwargs)
[rank1]:           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/apdcephfs_private/qy/projects/zy/BeeLinear/transformers/src/transformers/trainer.py", line 2325, in train
[rank1]:     return inner_training_loop(
[rank1]:            ^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/apdcephfs_private/qy/projects/zy/BeeLinear/transformers/src/transformers/trainer.py", line 2674, in _inner_training_loop
[rank1]:     tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
[rank1]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/apdcephfs_private/qy/projects/zy/BeeLinear/ms-swift/swift/trainers/rlhf_trainer/utils.py", line 428, in wrapper
[rank1]:     return func(self, *args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/apdcephfs_private/qy/projects/zy/BeeLinear/ms-swift/swift/trainers/rlhf_trainer/gkd_trainer.py", line 337, in training_step
[rank1]:     inputs = self._prepare_batch_inputs(inputs)
[rank1]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/apdcephfs_private/qy/projects/zy/BeeLinear/ms-swift/swift/trainers/rlhf_trainer/gkd_trainer.py", line 279, in _prepare_batch_inputs
[rank1]:     batch_encoded = to_device(template.data_collator(batch_encoded_inputs), self.model.device)
[rank1]:                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/apdcephfs_private/qy/projects/zy/BeeLinear/ms-swift/swift/llm/template/base.py", line 1453, in data_collator
[rank1]:     res = self._gkd_data_collator(batch, padding_to=padding_to)
[rank1]:           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/apdcephfs_private/qy/projects/zy/BeeLinear/ms-swift/swift/llm/template/base.py", line 1552, in _gkd_data_collator
[rank1]:     prompts_res = self._data_collator(prompts_batch, padding_to=padding_to)
[rank1]:                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/apdcephfs_private/qy/projects/zy/BeeLinear/ms-swift/swift/llm/template/template/qwen.py", line 452, in _data_collator
[rank1]:     res['position_ids'] = self._get_position_ids(res)
[rank1]:                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/apdcephfs_private/qy/projects/zy/BeeLinear/ms-swift/swift/llm/template/template/qwen.py", line 441, in _get_position_ids
[rank1]:     position_ids, _ = get_rope_index(
[rank1]:                       ^^^^^^^^^^^^^^^
[rank1]:   File "/apdcephfs_private/qy/projects/zy/BeeLinear/transformers/src/transformers/models/qwen2_5_vl/modeling_qwen2_5_vl.py", line 1125, in get_rope_index
[rank1]:     max_position_ids = position_ids.max(0, keepdim=False)[0].max(-1, keepdim=True)[0]
[rank1]:                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: IndexError: max(): Expected reduction dim 1 to have non-zero size.
[INFO:swift] last_model_checkpoint: None
[INFO:swift] best_model_checkpoint: None
[INFO:swift] images_dir: /apdcephfs_private/qy/projects/zy/BeeLinear/ms-swift/output/v7-20251212-162037/images
[rank0]: Traceback (most recent call last):
[rank0]:   File "/apdcephfs_private/qy/projects/zy/BeeLinear/ms-swift/swift/cli/rlhf.py", line 7, in <module>
[rank0]:     rlhf_main()
[rank0]:   File "/apdcephfs_private/qy/projects/zy/BeeLinear/ms-swift/swift/llm/train/rlhf.py", line 233, in rlhf_main
[rank0]:     return SwiftRLHF(args).main()
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/apdcephfs_private/qy/projects/zy/BeeLinear/ms-swift/swift/llm/base.py", line 49, in main
[rank0]:     result = self.run()
[rank0]:              ^^^^^^^^^^
[rank0]:   File "/apdcephfs_private/qy/projects/zy/BeeLinear/ms-swift/swift/ray/base.py", line 170, in wrapper
[rank0]:     return func(self, *args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/apdcephfs_private/qy/projects/zy/BeeLinear/ms-swift/swift/llm/train/sft.py", line 209, in run
[rank0]:     return self.train(trainer)
[rank0]:            ^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/apdcephfs_private/qy/projects/zy/BeeLinear/ms-swift/swift/llm/train/sft.py", line 257, in train
[rank0]:     trainer.train(trainer.args.resume_from_checkpoint)
[rank0]:   File "/apdcephfs_private/qy/projects/zy/BeeLinear/ms-swift/swift/trainers/mixin.py", line 840, in train
[rank0]:     res = super().train(*args, **kwargs)
[rank0]:           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/apdcephfs_private/qy/projects/zy/BeeLinear/transformers/src/transformers/trainer.py", line 2325, in train
[rank0]:     return inner_training_loop(
[rank0]:            ^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/apdcephfs_private/qy/projects/zy/BeeLinear/transformers/src/transformers/trainer.py", line 2674, in _inner_training_loop
[rank0]:     tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
[rank0]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/apdcephfs_private/qy/projects/zy/BeeLinear/ms-swift/swift/trainers/rlhf_trainer/utils.py", line 428, in wrapper
[rank0]:     return func(self, *args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/apdcephfs_private/qy/projects/zy/BeeLinear/ms-swift/swift/trainers/rlhf_trainer/gkd_trainer.py", line 337, in training_step
[rank0]:     inputs = self._prepare_batch_inputs(inputs)
[rank0]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/apdcephfs_private/qy/projects/zy/BeeLinear/ms-swift/swift/trainers/rlhf_trainer/gkd_trainer.py", line 279, in _prepare_batch_inputs
[rank0]:     batch_encoded = to_device(template.data_collator(batch_encoded_inputs), self.model.device)
[rank0]:                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/apdcephfs_private/qy/projects/zy/BeeLinear/ms-swift/swift/llm/template/base.py", line 1453, in data_collator
[rank0]:     res = self._gkd_data_collator(batch, padding_to=padding_to)
[rank0]:           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/apdcephfs_private/qy/projects/zy/BeeLinear/ms-swift/swift/llm/template/base.py", line 1552, in _gkd_data_collator
[rank0]:     prompts_res = self._data_collator(prompts_batch, padding_to=padding_to)
[rank0]:                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/apdcephfs_private/qy/projects/zy/BeeLinear/ms-swift/swift/llm/template/template/qwen.py", line 452, in _data_collator
[rank0]:     res['position_ids'] = self._get_position_ids(res)
[rank0]:                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/apdcephfs_private/qy/projects/zy/BeeLinear/ms-swift/swift/llm/template/template/qwen.py", line 441, in _get_position_ids
[rank0]:     position_ids, _ = get_rope_index(
[rank0]:                       ^^^^^^^^^^^^^^^
[rank0]:   File "/apdcephfs_private/qy/projects/zy/BeeLinear/transformers/src/transformers/models/qwen2_5_vl/modeling_qwen2_5_vl.py", line 1125, in get_rope_index
[rank0]:     max_position_ids = position_ids.max(0, keepdim=False)[0].max(-1, keepdim=True)[0]
[rank0]:                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: IndexError: max(): Expected reduction dim 1 to have non-zero size.

Train:   0%|          | 0/124 [00:10<?, ?it/s]
[rank0]:[W1212 16:21:14.892425929 ProcessGroupNCCL.cpp:1538] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
[rank3]: Traceback (most recent call last):
[rank3]:   File "/apdcephfs_private/qy/projects/zy/BeeLinear/ms-swift/swift/cli/rlhf.py", line 7, in <module>
[rank3]:     rlhf_main()
[rank3]:   File "/apdcephfs_private/qy/projects/zy/BeeLinear/ms-swift/swift/llm/train/rlhf.py", line 233, in rlhf_main
[rank3]:     return SwiftRLHF(args).main()
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/apdcephfs_private/qy/projects/zy/BeeLinear/ms-swift/swift/llm/base.py", line 49, in main
[rank3]:     result = self.run()
[rank3]:              ^^^^^^^^^^
[rank3]:   File "/apdcephfs_private/qy/projects/zy/BeeLinear/ms-swift/swift/ray/base.py", line 170, in wrapper
[rank3]:     return func(self, *args, **kwargs)
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/apdcephfs_private/qy/projects/zy/BeeLinear/ms-swift/swift/llm/train/sft.py", line 209, in run
[rank3]:     return self.train(trainer)
[rank3]:            ^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/apdcephfs_private/qy/projects/zy/BeeLinear/ms-swift/swift/llm/train/sft.py", line 257, in train
[rank3]:     trainer.train(trainer.args.resume_from_checkpoint)
[rank3]:   File "/apdcephfs_private/qy/projects/zy/BeeLinear/ms-swift/swift/trainers/mixin.py", line 840, in train
[rank3]:     res = super().train(*args, **kwargs)
[rank3]:           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/apdcephfs_private/qy/projects/zy/BeeLinear/transformers/src/transformers/trainer.py", line 2325, in train
[rank3]:     return inner_training_loop(
[rank3]:            ^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/apdcephfs_private/qy/projects/zy/BeeLinear/transformers/src/transformers/trainer.py", line 2674, in _inner_training_loop
[rank3]:     tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
[rank3]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/apdcephfs_private/qy/projects/zy/BeeLinear/ms-swift/swift/trainers/rlhf_trainer/utils.py", line 428, in wrapper
[rank3]:     return func(self, *args, **kwargs)
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/apdcephfs_private/qy/projects/zy/BeeLinear/ms-swift/swift/trainers/rlhf_trainer/gkd_trainer.py", line 337, in training_step
[rank3]:     inputs = self._prepare_batch_inputs(inputs)
[rank3]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/apdcephfs_private/qy/projects/zy/BeeLinear/ms-swift/swift/trainers/rlhf_trainer/gkd_trainer.py", line 279, in _prepare_batch_inputs
[rank3]:     batch_encoded = to_device(template.data_collator(batch_encoded_inputs), self.model.device)
[rank3]:                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/apdcephfs_private/qy/projects/zy/BeeLinear/ms-swift/swift/llm/template/base.py", line 1453, in data_collator
[rank3]:     res = self._gkd_data_collator(batch, padding_to=padding_to)
[rank3]:           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/apdcephfs_private/qy/projects/zy/BeeLinear/ms-swift/swift/llm/template/base.py", line 1552, in _gkd_data_collator
[rank3]:     prompts_res = self._data_collator(prompts_batch, padding_to=padding_to)
[rank3]:                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/apdcephfs_private/qy/projects/zy/BeeLinear/ms-swift/swift/llm/template/template/qwen.py", line 452, in _data_collator
[rank3]:     res['position_ids'] = self._get_position_ids(res)
[rank3]:                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/apdcephfs_private/qy/projects/zy/BeeLinear/ms-swift/swift/llm/template/template/qwen.py", line 441, in _get_position_ids
[rank3]:     position_ids, _ = get_rope_index(
[rank3]:                       ^^^^^^^^^^^^^^^
[rank3]:   File "/apdcephfs_private/qy/projects/zy/BeeLinear/transformers/src/transformers/models/qwen2_5_vl/modeling_qwen2_5_vl.py", line 1125, in get_rope_index
[rank3]:     max_position_ids = position_ids.max(0, keepdim=False)[0].max(-1, keepdim=True)[0]
[rank3]:                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: IndexError: max(): Expected reduction dim 1 to have non-zero size.
W1212 16:21:15.710000 158266 site-packages/torch/distributed/elastic/multiprocessing/api.py:900] Sending process 158339 closing signal SIGTERM
W1212 16:21:15.711000 158266 site-packages/torch/distributed/elastic/multiprocessing/api.py:900] Sending process 158341 closing signal SIGTERM
W1212 16:21:15.713000 158266 site-packages/torch/distributed/elastic/multiprocessing/api.py:900] Sending process 158342 closing signal SIGTERM
E1212 16:21:16.697000 158266 site-packages/torch/distributed/elastic/multiprocessing/api.py:874] failed (exitcode: 1) local_rank: 1 (pid: 158340) of binary: /jizhicfs/leoyizhang/anaconda3/envs/beelinear/bin/python3.12
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/jizhicfs/leoyizhang/anaconda3/envs/beelinear/lib/python3.12/site-packages/torch/distributed/run.py", line 905, in <module>
    main()
  File "/jizhicfs/leoyizhang/anaconda3/envs/beelinear/lib/python3.12/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 357, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/jizhicfs/leoyizhang/anaconda3/envs/beelinear/lib/python3.12/site-packages/torch/distributed/run.py", line 901, in main
    run(args)
  File "/jizhicfs/leoyizhang/anaconda3/envs/beelinear/lib/python3.12/site-packages/torch/distributed/run.py", line 892, in run
    elastic_launch(
  File "/jizhicfs/leoyizhang/anaconda3/envs/beelinear/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 143, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/jizhicfs/leoyizhang/anaconda3/envs/beelinear/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 277, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/apdcephfs_private/qy/projects/zy/BeeLinear/ms-swift/swift/cli/rlhf.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-12-12_16:21:15
  host      : TENCENT64.site
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 158340)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

Your hardware and system info
Write your system info like CUDA version/system/GPU/torch version here(在这里给出硬件信息和系统信息,如CUDA版本,系统,GPU型号和torch版本等)

CUDA: 12.9

GPU: H20

Package                       Version      Editable project location
----------------------------- ------------ ------------------------------------------------------------------
abnf                          2.2.0
absl-py                       2.3.1
accelerate                    1.12.0
addict                        2.4.0
aiofiles                      24.1.0
aiohappyeyeballs              2.6.1
aiohttp                       3.13.2
aiosignal                     1.4.0
aliyun-python-sdk-core        2.16.0
aliyun-python-sdk-kms         2.16.5
annotated-doc                 0.0.4
annotated-types               0.7.0
antlr4-python3-runtime        4.9.3
anyio                         4.11.0
attrdict                      2.0.1
attrs                         25.4.0
av                            16.0.1
backoff                       2.2.1
binpacking                    1.5.2
bitsandbytes                  0.48.2
brotli                        1.2.0
causal_conv1d                 1.5.4
certifi                       2025.11.12
cffi                          2.0.0
chardet                       5.2.0
charset-normalizer            3.4.4
cint                          1.0.0
click                         8.3.1
contourpy                     1.3.3
cpm-kernels                   1.0.11
crcmod                        1.7
cryptography                  46.0.3
cut-cross-entropy             25.1.1
cycler                        0.12.1
dacite                        1.9.2
datasets                      3.6.0
deepspeed                     0.18.2
diffusers                     0.35.2
dill                          0.3.8
diskcache                     5.6.3
distro                        1.9.0
docstring_parser              0.17.0
einops                        0.8.1
fastapi                       0.122.0
ffmpy                         1.0.0
fickling                      0.1.5
filelock                      3.20.0
fla-core                      0.4.0
flash_attn                    2.8.0.post2
flash-linear-attention        0.4.0
fonttools                     4.60.1
frozenlist                    1.8.0
fsspec                        2025.3.0
future                        1.0.0
gitdb                         4.0.12
GitPython                     3.1.45
gql                           4.0.0
gradio                        6.0.1
gradio_client                 2.0.0
graphql-core                  3.2.7
graphviz                      0.21
groovy                        0.1.2
grpcio                        1.76.0
h11                           0.16.0
hf_transfer                   0.1.9
hf-xet                        1.2.0
hjson                         3.1.0
httpcore                      1.0.9
httpx                         0.28.1
huggingface-hub               0.36.0
idna                          3.11
importlib_metadata            8.7.0
intervaltree                  3.1.0
jieba                         0.42.1
Jinja2                        3.1.6
jiter                         0.12.0
jmespath                      0.10.0
joblib                        1.5.2
json_repair                   0.54.2
jsonschema                    4.25.1
jsonschema-specifications     2025.9.1
kaitaistruct                  0.11
kiwisolver                    1.4.9
liger_kernel                  0.6.4
Markdown                      3.10
markdown-it-py                4.0.0
MarkupSafe                    3.0.3
matplotlib                    3.10.7
mdurl                         0.1.2
modelscope                    1.32.0
mpi4py                        4.1.1
mpmath                        1.3.0
ms_swift                      3.11.0.dev0
msgpack                       1.1.2
msgspec                       0.20.0
multidict                     6.7.0
multiprocess                  0.70.16
networkx                      3.6
ninja                         1.13.0
nltk                          3.9.2
numpy                         2.3.5
nvidia-cublas-cu12            12.8.4.1
nvidia-cuda-cupti-cu12        12.8.90
nvidia-cuda-nvrtc-cu12        12.8.93
nvidia-cuda-runtime-cu12      12.8.90
nvidia-cudnn-cu12             9.10.2.21
nvidia-cufft-cu12             11.3.3.83
nvidia-cufile-cu12            1.13.1.3
nvidia-curand-cu12            10.3.9.90
nvidia-cusolver-cu12          11.7.3.90
nvidia-cusparse-cu12          12.5.8.93
nvidia-cusparselt-cu12        0.7.1
nvidia-ml-py                  13.580.82
nvidia-nccl-cu12              2.27.3
nvidia-nvjitlink-cu12         12.8.93
nvidia-nvtx-cu12              12.8.90
omegaconf                     2.3.0
openai                        2.8.1
orjson                        3.11.4
oss2                          2.19.1
packaging                     25.0
pandas                        2.3.3
pdfminer.six                  20250506
peft                          0.18.0
pillow                        12.0.0
pip                           25.3
platformdirs                  4.5.0
polyfile-weave                0.5.7
propcache                     0.4.1
protobuf                      6.33.1
psutil                        7.1.3
py-cpuinfo                    9.0.0
pyarrow                       22.0.0
pycparser                     2.23
pycryptodome                  3.23.0
pydantic                      2.12.4
pydantic_core                 2.41.5
pydub                         0.25.1
Pygments                      2.19.2
pyparsing                     3.2.5
python-dateutil               2.9.0.post0
python-multipart              0.0.20
pytz                          2025.2
PyYAML                        6.0.3
qwen-vl-utils                 0.0.14
referencing                   0.37.0
regex                         2025.11.3
requests                      2.32.5
rich                          14.2.0
rouge                         1.0.1
rpds-py                       0.29.0
safehttpx                     0.1.7
safetensors                   0.7.0
scipy                         1.16.3
semantic-version              2.10.0
sentencepiece                 0.2.1
sentry-sdk                    2.46.0
setuptools                    80.9.0
shellingham                   1.5.4
shtab                         1.8.0
simplejson                    3.20.2
six                           1.17.0
smmap                         5.0.2
sniffio                       1.3.1
sortedcontainers              2.4.0
starlette                     0.50.0
stdlib-list                   0.11.1
sympy                         1.14.0
tenacity                      9.1.2
tensorboard                   2.20.0
tensorboard-data-server       0.7.2
tiktoken                      0.12.0
tokenizers                    0.22.1
tomlkit                       0.13.3
torch                         2.8.0
torchao                       0.13.0
torchaudio                    2.8.0+cu128
torchvision                   0.23.0+cu128
tqdm                          4.67.1
transformers                  4.57.1
transformers-stream-generator 0.0.5
triton                        3.4.0
trl                           0.24.0
typeguard                     4.4.4
typer                         0.20.0
typing_extensions             4.15.0
typing-inspection             0.4.2
tyro                          0.9.35
tzdata                        2025.2
unsloth                       2025.11.4
unsloth_zoo                   2025.11.5
urllib3                       2.5.0
uvicorn                       0.38.0
wandb                         0.23.0
weave                         0.52.20
Werkzeug                      3.1.3
wheel                         0.45.1
xformers                      0.0.32.post2
xxhash                        3.6.0
yarl                          1.22.0
zipp                          3.23.0
zstandard                     0.25.0

Additional context
Add any other context about the problem here(在这里补充其他信息)
In the same environment, sft can be trained without error.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions