Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

failed (exitcode: -8) local_rank: 6 (pid: 58423) of binary: /opt/miniconda/bin/python When run GRPO #254

Open
lmx760581375 opened this issue Feb 9, 2025 · 6 comments

Comments

@lmx760581375
Copy link

lmx760581375 commented Feb 9, 2025

[2025-02-09 21:54:30,386] [INFO] [config.py:988:print]   zero_enabled ................. True
[2025-02-09 21:54:30,386] [INFO] [config.py:988:print]   zero_force_ds_cpu_optimizer .. True
[2025-02-09 21:54:30,386] [INFO] [config.py:988:print]   zero_optimization_stage ...... 3
[2025-02-09 21:54:30,386] [INFO] [config.py:974:print_user_config]   json = {
    "train_batch_size": 256, 
    "train_micro_batch_size_per_gpu": 2, 
    "gradient_accumulation_steps": 16, 
    "zero_optimization": {
        "stage": 3, 
        "offload_optimizer": {
            "device": "none", 
            "nvme_path": null
        }, 
        "offload_param": {
            "device": "none", 
            "nvme_path": null
        }, 
        "stage3_gather_16bit_weights_on_model_save": true
    }, 
    "gradient_clipping": 1.0, 
    "steps_per_print": inf, 
    "bf16": {
        "enabled": true
    }, 
    "fp16": {
        "enabled": false
    }, 
    "zero_allow_untested_optimizer": true
}
[INFO|trainer.py:2369] 2025-02-09 21:54:30,388 >> ***** Running training *****
[INFO|trainer.py:2370] 2025-02-09 21:54:30,388 >>   Num examples = 72,441
[INFO|trainer.py:2371] 2025-02-09 21:54:30,388 >>   Num Epochs = 1
[INFO|trainer.py:2372] 2025-02-09 21:54:30,388 >>   Instantaneous batch size per device = 2
[INFO|trainer.py:2375] 2025-02-09 21:54:30,388 >>   Total train batch size (w. parallel, distributed & accumulation) = 256
[INFO|trainer.py:2376] 2025-02-09 21:54:30,388 >>   Gradient Accumulation steps = 16
[INFO|trainer.py:2377] 2025-02-09 21:54:30,388 >>   Total optimization steps = 2,263
[INFO|trainer.py:2378] 2025-02-09 21:54:30,389 >>   Number of trainable parameters = 1,543,714,304
  0%|                                                                                                           | 0/2263 [00:00<?, ?it/s][WARNING|logging.py:328] 2025-02-09 21:54:30,733 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
[WARNING|logging.py:328] 2025-02-09 21:54:30,743 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
[WARNING|logging.py:328] 2025-02-09 21:54:30,744 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
[WARNING|logging.py:328] 2025-02-09 21:54:30,744 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
[WARNING|logging.py:328] 2025-02-09 21:54:30,744 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
[WARNING|logging.py:328] 2025-02-09 21:54:30,745 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
[WARNING|logging.py:328] 2025-02-09 21:54:30,745 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
[WARNING|logging.py:328] 2025-02-09 21:54:30,748 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
[2025-02-09 21:54:30,850] [WARNING] [parameter_offload.py:87:_apply_to_tensors_only] A module has unknown inputs or outputs type (<class 'transformers.cache_utils.DynamicCache'>) and the tensors embedded in it cannot be detected. The ZeRO-3 hooks designed to trigger before or after backward pass of the module relies on knowing the input and output tensors and therefore may not get triggered properly.
W0209 21:54:31.391000 139777117394752 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 58417 closing signal SIGTERM
W0209 21:54:31.391000 139777117394752 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 58418 closing signal SIGTERM
W0209 21:54:31.391000 139777117394752 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 58419 closing signal SIGTERM
W0209 21:54:31.391000 139777117394752 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 58420 closing signal SIGTERM
W0209 21:54:31.391000 139777117394752 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 58421 closing signal SIGTERM
W0209 21:54:31.391000 139777117394752 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 58422 closing signal SIGTERM
W0209 21:54:31.391000 139777117394752 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 58424 closing signal SIGTERM
E0209 21:54:32.372000 139777117394752 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: -8) local_rank: 6 (pid: 58423) of binary: /opt/miniconda/bin/python
Traceback (most recent call last):
  File "/opt/miniconda/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/opt/miniconda/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
    args.func(args)
  File "/opt/miniconda/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1153, in launch_command
    deepspeed_launcher(args)
  File "/opt/miniconda/lib/python3.10/site-packages/accelerate/commands/launch.py", line 846, in deepspeed_launcher
    distrib_run.run(args)
  File "/opt/miniconda/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run
    elastic_launch(
  File "/opt/miniconda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/miniconda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
=====================================================
src/open_r1/grpo.py FAILED
-----------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
-----------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-02-09_21:54:31
  host      : CENT64.site
  rank      : 6 (local_rank: 6)
  exitcode  : -8 (pid: 58423)
  error_file: <N/A>
  traceback : Signal 8 (SIGFPE) received by PID 58423
=====================================================
@lmx760581375
Copy link
Author

lmx760581375 commented Feb 9, 2025

when i try to get more information of this exception, I add those code in grpo.py:

# 启用反向传播异常检测
torch.autograd.set_detect_anomaly(True)

# 设置浮点运算异常检测(针对CUDA)
torch._C._cuda_set_sync_debug_mode(1)  # 捕获CUDA异步错误
torch._C._debug_set_autodiff_subgraph_inlining(False)  # 禁用优化以保留堆栈
torch.autograd.set_detect_anomaly(True)  # 捕获反向传播异常

the log became:

/opt/miniconda/lib/python3.10/site-packages/transformers/trainer.py:2431: UserWarning: called a synchronizing CUDA operation (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:147.)
  tr_loss = torch.tensor(0.0).to(args.device)
  0%|                                                                                                           | 0/2263 [00:00<?, ?it/s]/opt/miniconda/lib/python3.10/site-packages/transformers/trainer.py:3608: UserWarning: called a synchronizing CUDA operation (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:147.)
  return data.to(**kwargs)
/opt/miniconda/lib/python3.10/site-packages/transformers/generation/utils.py:1846: UserWarning: called a synchronizing CUDA operation (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:147.)
  return torch.tensor(token, device=device, dtype=torch.long)
/opt/miniconda/lib/python3.10/site-packages/transformers/generation/utils.py:1846: UserWarning: called a synchronizing CUDA operation (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:147.)
  return torch.tensor(token, device=device, dtype=torch.long)
/opt/miniconda/lib/python3.10/site-packages/transformers/generation/utils.py:1846: UserWarning: called a synchronizing CUDA operation (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:147.)
  return torch.tensor(token, device=device, dtype=torch.long)
/opt/miniconda/lib/python3.10/site-packages/transformers/generation/utils.py:1846: UserWarning: called a synchronizing CUDA operation (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:147.)
  return torch.tensor(token, device=device, dtype=torch.long)
/opt/miniconda/lib/python3.10/site-packages/transformers/generation/utils.py:1846: UserWarning: called a synchronizing CUDA operation (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:147.)
  return torch.tensor(token, device=device, dtype=torch.long)
/opt/miniconda/lib/python3.10/site-packages/transformers/generation/utils.py:1846: UserWarning: called a synchronizing CUDA operation (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:147.)
  return torch.tensor(token, device=device, dtype=torch.long)
/opt/miniconda/lib/python3.10/site-packages/transformers/generation/utils.py:1846: UserWarning: called a synchronizing CUDA operation (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:147.)
  return torch.tensor(token, device=device, dtype=torch.long)
/opt/miniconda/lib/python3.10/site-packages/transformers/generation/utils.py:1846: UserWarning: called a synchronizing CUDA operation (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:147.)
  return torch.tensor(token, device=device, dtype=torch.long)
/opt/miniconda/lib/python3.10/site-packages/transformers/generation/utils.py:1881: UserWarning: called a synchronizing CUDA operation (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:147.)
  eos_token_tensor is not None
/opt/miniconda/lib/python3.10/site-packages/transformers/generation/utils.py:1881: UserWarning: called a synchronizing CUDA operation (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:147.)
  eos_token_tensor is not None
/opt/miniconda/lib/python3.10/site-packages/transformers/generation/utils.py:1881: UserWarning: called a synchronizing CUDA operation (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:147.)
  eos_token_tensor is not None
/opt/miniconda/lib/python3.10/site-packages/transformers/generation/utils.py:1881: UserWarning: called a synchronizing CUDA operation (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:147.)
  eos_token_tensor is not None
/opt/miniconda/lib/python3.10/site-packages/transformers/generation/utils.py:1881: UserWarning: called a synchronizing CUDA operation (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:147.)
  eos_token_tensor is not None
/opt/miniconda/lib/python3.10/site-packages/transformers/generation/utils.py:1881: UserWarning: called a synchronizing CUDA operation (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:147.)
  eos_token_tensor is not None
/opt/miniconda/lib/python3.10/site-packages/transformers/generation/utils.py:1881: UserWarning: called a synchronizing CUDA operation (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:147.)
  eos_token_tensor is not None
/opt/miniconda/lib/python3.10/site-packages/transformers/generation/utils.py:1881: UserWarning: called a synchronizing CUDA operation (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:147.)
  eos_token_tensor is not None
/opt/miniconda/lib/python3.10/site-packages/transformers/generation/utils.py:1890: UserWarning: called a synchronizing CUDA operation (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:147.)
  if eos_token_tensor is not None and (
/opt/miniconda/lib/python3.10/site-packages/transformers/generation/utils.py:1890: UserWarning: called a synchronizing CUDA operation (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:147.)
  if eos_token_tensor is not None and (
/opt/miniconda/lib/python3.10/site-packages/transformers/generation/utils.py:1890: UserWarning: called a synchronizing CUDA operation (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:147.)
  if eos_token_tensor is not None and (
/opt/miniconda/lib/python3.10/site-packages/transformers/generation/utils.py:1890: UserWarning: called a synchronizing CUDA operation (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:147.)
  if eos_token_tensor is not None and (
/opt/miniconda/lib/python3.10/site-packages/transformers/generation/utils.py:1890: UserWarning: called a synchronizing CUDA operation (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:147.)
  if eos_token_tensor is not None and (
/opt/miniconda/lib/python3.10/site-packages/transformers/generation/utils.py:1890: UserWarning: called a synchronizing CUDA operation (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:147.)
  if eos_token_tensor is not None and (
/opt/miniconda/lib/python3.10/site-packages/transformers/generation/utils.py:1890: UserWarning: called a synchronizing CUDA operation (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:147.)
  if eos_token_tensor is not None and (
/opt/miniconda/lib/python3.10/site-packages/transformers/generation/utils.py:1890: UserWarning: called a synchronizing CUDA operation (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:147.)
  if eos_token_tensor is not None and (
/opt/miniconda/lib/python3.10/site-packages/transformers/generation/utils.py:2043: UserWarning: called a synchronizing CUDA operation (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:147.)
  and torch.sum(inputs_tensor[:, -1] == generation_config._pad_token_tensor) > 0
/opt/miniconda/lib/python3.10/site-packages/transformers/generation/utils.py:2043: UserWarning: called a synchronizing CUDA operation (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:147.)
  and torch.sum(inputs_tensor[:, -1] == generation_config._pad_token_tensor) > 0
/opt/miniconda/lib/python3.10/site-packages/transformers/generation/utils.py:2043: UserWarning: called a synchronizing CUDA operation (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:147.)
  and torch.sum(inputs_tensor[:, -1] == generation_config._pad_token_tensor) > 0
/opt/miniconda/lib/python3.10/site-packages/transformers/generation/utils.py:2043: UserWarning: called a synchronizing CUDA operation (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:147.)
  and torch.sum(inputs_tensor[:, -1] == generation_config._pad_token_tensor) > 0
/opt/miniconda/lib/python3.10/site-packages/transformers/generation/utils.py:2043: UserWarning: called a synchronizing CUDA operation (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:147.)
  and torch.sum(inputs_tensor[:, -1] == generation_config._pad_token_tensor) > 0
/opt/miniconda/lib/python3.10/site-packages/transformers/generation/utils.py:2043: UserWarning: called a synchronizing CUDA operation (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:147.)
  and torch.sum(inputs_tensor[:, -1] == generation_config._pad_token_tensor) > 0
/opt/miniconda/lib/python3.10/site-packages/transformers/generation/utils.py:2043: UserWarning: called a synchronizing CUDA operation (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:147.)
  and torch.sum(inputs_tensor[:, -1] == generation_config._pad_token_tensor) > 0
/opt/miniconda/lib/python3.10/site-packages/transformers/generation/utils.py:2043: UserWarning: called a synchronizing CUDA operation (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:147.)
  and torch.sum(inputs_tensor[:, -1] == generation_config._pad_token_tensor) > 0
/opt/miniconda/lib/python3.10/site-packages/transformers/generation/utils.py:2447: UserWarning: called a synchronizing CUDA operation (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:147.)
  this_peer_finished_flag = torch.tensor(0.0 if this_peer_finished else 1.0).to(device)
/opt/miniconda/lib/python3.10/site-packages/transformers/generation/utils.py:2447: UserWarning: called a synchronizing CUDA operation (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:147.)
  this_peer_finished_flag = torch.tensor(0.0 if this_peer_finished else 1.0).to(device)
/opt/miniconda/lib/python3.10/site-packages/transformers/generation/utils.py:2447: UserWarning: called a synchronizing CUDA operation (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:147.)
  this_peer_finished_flag = torch.tensor(0.0 if this_peer_finished else 1.0).to(device)
/opt/miniconda/lib/python3.10/site-packages/transformers/generation/utils.py:2447: UserWarning: called a synchronizing CUDA operation (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:147.)
  this_peer_finished_flag = torch.tensor(0.0 if this_peer_finished else 1.0).to(device)
/opt/miniconda/lib/python3.10/site-packages/transformers/generation/utils.py:2447: UserWarning: called a synchronizing CUDA operation (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:147.)
  this_peer_finished_flag = torch.tensor(0.0 if this_peer_finished else 1.0).to(device)
/opt/miniconda/lib/python3.10/site-packages/transformers/generation/utils.py:2447: UserWarning: called a synchronizing CUDA operation (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:147.)
  this_peer_finished_flag = torch.tensor(0.0 if this_peer_finished else 1.0).to(device)
/opt/miniconda/lib/python3.10/site-packages/transformers/generation/utils.py:2447: UserWarning: called a synchronizing CUDA operation (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:147.)
  this_peer_finished_flag = torch.tensor(0.0 if this_peer_finished else 1.0).to(device)
/opt/miniconda/lib/python3.10/site-packages/transformers/generation/utils.py:2447: UserWarning: called a synchronizing CUDA operation (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:147.)
  this_peer_finished_flag = torch.tensor(0.0 if this_peer_finished else 1.0).to(device)
/opt/miniconda/lib/python3.10/site-packages/transformers/generation/utils.py:2451: UserWarning: called a synchronizing CUDA operation (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:147.)
  if this_peer_finished_flag.item() == 0.0:
/opt/miniconda/lib/python3.10/site-packages/transformers/generation/utils.py:2451: UserWarning: called a synchronizing CUDA operation (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:147.)
  if this_peer_finished_flag.item() == 0.0:
/opt/miniconda/lib/python3.10/site-packages/transformers/generation/utils.py:2451: UserWarning: called a synchronizing CUDA operation (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:147.)
  if this_peer_finished_flag.item() == 0.0:
/opt/miniconda/lib/python3.10/site-packages/transformers/generation/utils.py:2451: UserWarning: called a synchronizing CUDA operation (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:147.)
  if this_peer_finished_flag.item() == 0.0:
/opt/miniconda/lib/python3.10/site-packages/transformers/generation/utils.py:2451: UserWarning: called a synchronizing CUDA operation (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:147.)
  if this_peer_finished_flag.item() == 0.0:
/opt/miniconda/lib/python3.10/site-packages/transformers/generation/utils.py:2451: UserWarning: called a synchronizing CUDA operation (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:147.)
  if this_peer_finished_flag.item() == 0.0:
/opt/miniconda/lib/python3.10/site-packages/transformers/generation/utils.py:2451: UserWarning: called a synchronizing CUDA operation (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:147.)
  if this_peer_finished_flag.item() == 0.0:
/opt/miniconda/lib/python3.10/site-packages/transformers/generation/utils.py:388: UserWarning: called a synchronizing CUDA operation (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:147.)
  or (is_torchdynamo_compiling() or cache_position[-1] >= input_ids.shape[1])  # Exception 3
/opt/miniconda/lib/python3.10/site-packages/transformers/generation/utils.py:388: UserWarning: called a synchronizing CUDA operation (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:147.)
  or (is_torchdynamo_compiling() or cache_position[-1] >= input_ids.shape[1])  # Exception 3
/opt/miniconda/lib/python3.10/site-packages/transformers/generation/utils.py:388: UserWarning: called a synchronizing CUDA operation (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:147.)
  or (is_torchdynamo_compiling() or cache_position[-1] >= input_ids.shape[1])  # Exception 3
/opt/miniconda/lib/python3.10/site-packages/transformers/generation/utils.py:388: UserWarning: called a synchronizing CUDA operation (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:147.)
  or (is_torchdynamo_compiling() or cache_position[-1] >= input_ids.shape[1])  # Exception 3
/opt/miniconda/lib/python3.10/site-packages/transformers/generation/utils.py:388: UserWarning: called a synchronizing CUDA operation (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:147.)
  or (is_torchdynamo_compiling() or cache_position[-1] >= input_ids.shape[1])  # Exception 3
/opt/miniconda/lib/python3.10/site-packages/transformers/generation/utils.py:388: UserWarning: called a synchronizing CUDA operation (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:147.)
  or (is_torchdynamo_compiling() or cache_position[-1] >= input_ids.shape[1])  # Exception 3
/opt/miniconda/lib/python3.10/site-packages/transformers/generation/utils.py:388: UserWarning: called a synchronizing CUDA operation (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:147.)
  or (is_torchdynamo_compiling() or cache_position[-1] >= input_ids.shape[1])  # Exception 3
/opt/miniconda/lib/python3.10/site-packages/transformers/generation/utils.py:2451: UserWarning: called a synchronizing CUDA operation (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:147.)
  if this_peer_finished_flag.item() == 0.0:
/opt/miniconda/lib/python3.10/site-packages/transformers/generation/utils.py:388: UserWarning: called a synchronizing CUDA operation (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:147.)
  or (is_torchdynamo_compiling() or cache_position[-1] >= input_ids.shape[1])  # Exception 3
[WARNING|logging.py:328] 2025-02-09 22:57:46,111 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
[WARNING|logging.py:328] 2025-02-09 22:57:46,112 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
[WARNING|logging.py:328] 2025-02-09 22:57:46,113 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
/opt/miniconda/lib/python3.10/site-packages/transformers/models/qwen2/modeling_qwen2.py:614: UserWarning: called a synchronizing CUDA operation (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:147.)
  if attention_mask is not None and (attention_mask == 0.0).any():
[WARNING|logging.py:328] 2025-02-09 22:57:46,114 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
[WARNING|logging.py:328] 2025-02-09 22:57:46,114 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
/opt/miniconda/lib/python3.10/site-packages/transformers/models/qwen2/modeling_qwen2.py:614: UserWarning: called a synchronizing CUDA operation (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:147.)
  if attention_mask is not None and (attention_mask == 0.0).any():
/opt/miniconda/lib/python3.10/site-packages/transformers/models/qwen2/modeling_qwen2.py:614: UserWarning: called a synchronizing CUDA operation (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:147.)
  if attention_mask is not None and (attention_mask == 0.0).any():
[WARNING|logging.py:328] 2025-02-09 22:57:46,117 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
/opt/miniconda/lib/python3.10/site-packages/transformers/models/qwen2/modeling_qwen2.py:614: UserWarning: called a synchronizing CUDA operation (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:147.)
  if attention_mask is not None and (attention_mask == 0.0).any():
/opt/miniconda/lib/python3.10/site-packages/transformers/models/qwen2/modeling_qwen2.py:614: UserWarning: called a synchronizing CUDA operation (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:147.)
  if attention_mask is not None and (attention_mask == 0.0).any():
[WARNING|logging.py:328] 2025-02-09 22:57:46,118 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
/opt/miniconda/lib/python3.10/site-packages/transformers/models/qwen2/modeling_qwen2.py:614: UserWarning: called a synchronizing CUDA operation (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:147.)
  if attention_mask is not None and (attention_mask == 0.0).any():
/opt/miniconda/lib/python3.10/site-packages/transformers/models/qwen2/modeling_qwen2.py:614: UserWarning: called a synchronizing CUDA operation (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:147.)
  if attention_mask is not None and (attention_mask == 0.0).any():
[WARNING|logging.py:328] 2025-02-09 22:57:46,127 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
/opt/miniconda/lib/python3.10/site-packages/transformers/models/qwen2/modeling_qwen2.py:614: UserWarning: called a synchronizing CUDA operation (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:147.)
  if attention_mask is not None and (attention_mask == 0.0).any():
[2025-02-09 22:57:46,233] [WARNING] [parameter_offload.py:87:_apply_to_tensors_only] A module has unknown inputs or outputs type (<class 'transformers.cache_utils.DynamicCache'>) and the tensors embedded in it cannot be detected. The ZeRO-3 hooks designed to trigger before or after backward pass of the module relies on knowing the input and output tensors and therefore may not get triggered properly.
/opt/miniconda/lib/python3.10/site-packages/transformers/modeling_flash_attention_utils.py:330: UserWarning: called a synchronizing CUDA operation (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:147.)
  max_length_q is not None or (query_length != 1 and not (torch.diff(position_ids, dim=-1) >= 0).all())
/opt/miniconda/lib/python3.10/site-packages/transformers/modeling_flash_attention_utils.py:330: UserWarning: called a synchronizing CUDA operation (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:147.)
  max_length_q is not None or (query_length != 1 and not (torch.diff(position_ids, dim=-1) >= 0).all())
/opt/miniconda/lib/python3.10/site-packages/transformers/modeling_flash_attention_utils.py:330: UserWarning: called a synchronizing CUDA operation (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:147.)
  max_length_q is not None or (query_length != 1 and not (torch.diff(position_ids, dim=-1) >= 0).all())
/opt/miniconda/lib/python3.10/site-packages/transformers/modeling_flash_attention_utils.py:330: UserWarning: called a synchronizing CUDA operation (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:147.)
  max_length_q is not None or (query_length != 1 and not (torch.diff(position_ids, dim=-1) >= 0).all())
[2025-02-09 22:57:46,742] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 106008 closing signal SIGTERM
[2025-02-09 22:57:46,743] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 106009 closing signal SIGTERM
[2025-02-09 22:57:46,744] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 106010 closing signal SIGTERM
[2025-02-09 22:57:46,749] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 106011 closing signal SIGTERM
[2025-02-09 22:57:46,759] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 106012 closing signal SIGTERM
[2025-02-09 22:57:46,762] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 106013 closing signal SIGTERM
[2025-02-09 22:57:46,762] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 106014 closing signal SIGTERM
[2025-02-09 22:57:47,692] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -8) local_rank: 7 (pid: 106015) of binary: /opt/miniconda/bin/python
Traceback (most recent call last):
  File "/opt/miniconda/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/opt/miniconda/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
    args.func(args)
  File "/opt/miniconda/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1153, in launch_command
    deepspeed_launcher(args)
  File "/opt/miniconda/lib/python3.10/site-packages/accelerate/commands/launch.py", line 846, in deepspeed_launcher
    distrib_run.run(args)
  File "/opt/miniconda/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/opt/miniconda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/miniconda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
======================================================
src/open_r1/grpo.py FAILED
------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-02-09_22:57:46
  host      : 
  rank      : 7 (local_rank: 7)
  exitcode  : -8 (pid: 106015)
  error_file: <N/A>
  traceback : Signal 8 (SIGFPE) received by PID 106015
======================================================

@lmx760581375
Copy link
Author

This is probably due to cuda and torch environment issues, I tried to run it on cuda11.8 a100 and grpo worked fine when using zero2.yaml. But when running h20 on cuda12.2, the above problem occurs

@lmx760581375
Copy link
Author

I solved this problem by upgrading GCC to 12

@troy12x
Copy link

troy12x commented Feb 12, 2025

have the same problem cuda 12.4 , gcc version is (openr1) ubuntu@192-222-54-131:/open-quasar$ gcc --version
gcc (Ubuntu 12.3.0-1ubuntu1
22.04) 12.3.0

still the same issue ? (8xh100 )

@albertbou92
Copy link

albertbou92 commented Feb 12, 2025

I have the same problem

cuda 12.4
gcc 12.3.0
8xh100

any solutions?

the code works if I only use 1 GPU though, when I try with more than 1 I get the error

@troy12x
Copy link

troy12x commented Feb 12, 2025

I have the same problem

cuda 12.4 gcc 12.3.0 8xh100

any solutions?

the code works if I only use 1 GPU though, when I try with more than 1 I get the error

same !!!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants