[Bug]: An error occurred when deploying DeepSeek-R1-Channel-INT8 on two MI250*8 machines using Ray

### Your current environment

<details>
<summary>The output of <code>python collect_env.py</code></summary>

```text
python collect_env.py
INFO 05-21 07:52:06 [__init__.py:239] Automatically detected platform rocm.
Collecting environment information...
==============================
        System Info
==============================
OS                           : Ubuntu 22.04.5 LTS (x86_64)
GCC version                  : (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version                : 18.0.0git (https://github.com/RadeonOpenCompute/llvm-project roc-6.3.1 24491 1e0fda770a2079fbd71e4b70974d74f62fd3af10)
CMake version                : version 3.31.6
Libc version                 : glibc-2.35

==============================
       PyTorch Info
==============================
PyTorch version              : 2.7.0+gitf717b2a
Is debug build               : False
CUDA used to build PyTorch   : N/A
ROCM used to build PyTorch   : 6.3.42133-1b9c17779

==============================
      Python Environment
==============================
Python version               : 3.12.10 (main, Apr  9 2025, 08:55:05) [GCC 11.4.0] (64-bit runtime)
Python platform              : Linux-5.15.0-138-generic-x86_64-with-glibc2.35

==============================
       CUDA / GPU Info
==============================
Is CUDA available            : True
CUDA runtime version         : Could not collect
CUDA_MODULE_LOADING set to   : LAZY
GPU models and configuration : AMD Instinct MI210 (gfx90a:sramecc+:xnack-)
Nvidia driver version        : Could not collect
cuDNN version                : Could not collect
HIP runtime version          : 6.3.42133
MIOpen runtime version       : 3.3.0
Is XNNPACK available         : True

==============================
          CPU Info
==============================
Architecture:                         x86_64
CPU op-mode(s):                       32-bit, 64-bit
Address sizes:                        43 bits physical, 48 bits virtual
Byte Order:                           Little Endian
CPU(s):                               192
On-line CPU(s) list:                  0-191
Vendor ID:                            AuthenticAMD
Model name:                           AMD EPYC 7K62 48-Core Processor
CPU family:                           23
Model:                                49
Thread(s) per core:                   2
Core(s) per socket:                   48
Socket(s):                            2
Stepping:                             0
Frequency boost:                      enabled
CPU max MHz:                          2600.0000
CPU min MHz:                          1500.0000
BogoMIPS:                             5200.23
Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip rdpid overflow_recov succor smca sme sev sev_es
Virtualization:                       AMD-V
L1d cache:                            3 MiB (96 instances)
L1i cache:                            3 MiB (96 instances)
L2 cache:                             48 MiB (96 instances)
L3 cache:                             384 MiB (24 instances)
NUMA node(s):                         2
NUMA node0 CPU(s):                    0-47,96-143
NUMA node1 CPU(s):                    48-95,144-191
Vulnerability Gather data sampling:   Not affected
Vulnerability Itlb multihit:          Not affected
Vulnerability L1tf:                   Not affected
Vulnerability Mds:                    Not affected
Vulnerability Meltdown:               Not affected
Vulnerability Mmio stale data:        Not affected
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed:               Mitigation; untrained return thunk; SMT enabled with STIBP protection
Vulnerability Spec rstack overflow:   Mitigation; safe RET
Vulnerability Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:             Mitigation; Retpolines; IBPB conditional; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds:                  Not affected
Vulnerability Tsx async abort:        Not affected

==============================
Versions of relevant libraries
==============================
[pip3] numpy==1.26.4
[pip3] pyzmq==26.4.0
[pip3] torch==2.7.0+gitf717b2a
[pip3] torchvision==0.21.0+7af6987
[pip3] transformers==4.51.3
[pip3] triton==3.2.0+gite5be006a
[conda] Could not collect

==============================
         vLLM Info
==============================
ROCM Version                 : 6.3.42133-1b9c17779
Neuron SDK Version           : N/A
vLLM Version                 : 0.8.6.dev3+gd60b5a337 (git sha: d60b5a337)
vLLM Build Flags:
  CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
  ============================ ROCm System Management Interface ============================
================================ Weight between two GPUs =================================
       GPU0         GPU1         GPU2         GPU3         GPU4         GPU5         GPU6         GPU7         
GPU0   0            15           15           15           72           72           72           72           
GPU1   15           0            15           15           72           72           72           72           
GPU2   15           15           0            15           72           72           72           72           
GPU3   15           15           15           0            72           72           72           72           
GPU4   72           72           72           72           0            15           15           15           
GPU5   72           72           72           72           15           0            15           15           
GPU6   72           72           72           72           15           15           0            15           
GPU7   72           72           72           72           15           15           15           0            

================================= Hops between two GPUs ==================================
       GPU0         GPU1         GPU2         GPU3         GPU4         GPU5         GPU6         GPU7         
GPU0   0            1            1            1            3            3            3            3            
GPU1   1            0            1            1            3            3            3            3            
GPU2   1            1            0            1            3            3            3            3            
GPU3   1            1            1            0            3            3            3            3            
GPU4   3            3            3            3            0            1            1            1            
GPU5   3            3            3            3            1            0            1            1            
GPU6   3            3            3            3            1            1            0            1            
GPU7   3            3            3            3            1            1            1            0            

=============================== Link Type between two GPUs ===============================
       GPU0         GPU1         GPU2         GPU3         GPU4         GPU5         GPU6         GPU7         
GPU0   0            XGMI         XGMI         XGMI         PCIE         PCIE         PCIE         PCIE         
GPU1   XGMI         0            XGMI         XGMI         PCIE         PCIE         PCIE         PCIE         
GPU2   XGMI         XGMI         0            XGMI         PCIE         PCIE         PCIE         PCIE         
GPU3   XGMI         XGMI         XGMI         0            PCIE         PCIE         PCIE         PCIE         
GPU4   PCIE         PCIE         PCIE         PCIE         0            XGMI         XGMI         XGMI         
GPU5   PCIE         PCIE         PCIE         PCIE         XGMI         0            XGMI         XGMI         
GPU6   PCIE         PCIE         PCIE         PCIE         XGMI         XGMI         0            XGMI         
GPU7   PCIE         PCIE         PCIE         PCIE         XGMI         XGMI         XGMI         0            

======================================= Numa Nodes =======================================
GPU[0]          : (Topology) Numa Node: 0
GPU[0]          : (Topology) Numa Affinity: 0
GPU[1]          : (Topology) Numa Node: 0
GPU[1]          : (Topology) Numa Affinity: 0
GPU[2]          : (Topology) Numa Node: 0
GPU[2]          : (Topology) Numa Affinity: 0
GPU[3]          : (Topology) Numa Node: 0
GPU[3]          : (Topology) Numa Affinity: 0
GPU[4]          : (Topology) Numa Node: 1
GPU[4]          : (Topology) Numa Affinity: 1
GPU[5]          : (Topology) Numa Node: 1
GPU[5]          : (Topology) Numa Affinity: 1
GPU[6]          : (Topology) Numa Node: 1
GPU[6]          : (Topology) Numa Affinity: 1
GPU[7]          : (Topology) Numa Node: 1
GPU[7]          : (Topology) Numa Affinity: 1
================================== End of ROCm SMI Log ===================================

==============================
     Environment Variables
==============================
PYTORCH_TUNABLEOP_TUNING=0
PYTORCH_TUNABLEOP_ENABLED=1
VLLM_WORKER_MULTIPROC_METHOD=spawn
PYTORCH_ROCM_ARCH=gfx942
LD_LIBRARY_PATH=/opt/rocm/lib:/usr/local/lib:
PYTORCH_TUNABLEOP_FILENAME=/app/afo_tune_device_%d_full.csv
NCCL_CUMEM_ENABLE=0
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
CUDA_MODULE_LOADING=LAZY
```

</details>


### 🐛 Describe the bug

docker image:
```
rocm/vllm:rocm6.3.1_vllm_0.8.5_20250513
```

command:
```
vllm serve /app/model/DeepSeek-R1-Channel-INT8/ --tensor-parallel-size 2 -pp 8 --gpu-memory-utilization 0.98 --max-model-len 128 --served_model_name qwen-base --port 8004 --distributed-executor-backend ray --cpu-offload-gb 48
```

logs:
```
INFO 05-20 09:54:11 [__init__.py:239] Automatically detected platform rocm.
INFO 05-20 09:54:23 [api_server.py:1042] vLLM API server version 0.8.6.dev3+gd60b5a337
INFO 05-20 09:54:23 [api_server.py:1043] args: Namespace(subparser='serve', model_tag='/app/model/DeepSeek-R1-Channel-INT8/', config='', host=None, port=8004, uvicorn_log_level='info', disable_uvicorn_access_log=False, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, enable_ssl_refresh=False, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='/app/model/DeepSeek-R1-Channel-INT8/', task='auto', tokenizer=None, tokenizer_mode='auto', trust_remote_code=False, dtype='auto', seed=None, hf_config_path=None, allowed_local_media_path='', revision=None, code_revision=None, rope_scaling={}, rope_theta=None, tokenizer_revision=None, max_model_len=128, quantization=None, enforce_eager=False, max_seq_len_to_capture=8192, max_logprobs=20, disable_sliding_window=False, disable_cascade_attn=False, skip_tokenizer_init=False, enable_prompt_embeds=False, served_model_name=['qwen-base'], disable_async_output_proc=False, config_format='auto', hf_token=None, hf_overrides={}, override_neuron_config={}, override_pooler_config=None, logits_processor_pattern=None, generation_config='auto', override_generation_config={}, enable_sleep_mode=False, model_impl='auto', load_format='auto', download_dir=None, model_loader_extra_config={}, ignore_patterns=None, use_tqdm_on_load=True, qlora_adapter_name_or_path=None, pt_load_map_location='cpu', guided_decoding_backend='auto', guided_decoding_disable_fallback=False, guided_decoding_disable_any_whitespace=False, guided_decoding_disable_additional_properties=False, enable_reasoning=None, reasoning_parser='', distributed_executor_backend='ray', pipeline_parallel_size=8, tensor_parallel_size=2, data_parallel_size=1, enable_expert_parallel=False, max_parallel_loading_workers=None, ray_workers_use_nsight=False, disable_custom_all_reduce=False, worker_cls='auto', worker_extension_cls='', block_size=None, gpu_memory_utilization=0.98, swap_space=128.0, kv_cache_dtype='auto', num_gpu_blocks_override=None, enable_prefix_caching=None, prefix_caching_hash_algo='builtin', cpu_offload_gb=48.0, calculate_kv_scales=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config={}, limit_mm_per_prompt={}, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=None, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=None, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', speculative_config=None, show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, max_num_batched_tokens=None, max_num_seqs=None, max_num_partial_prefills=1, max_long_partial_prefills=1, cuda_graph_sizes=[512], long_prefill_token_threshold=0, num_lookahead_slots=0, scheduler_delay_factor=0.0, preemption_mode=None, num_scheduler_steps=1, multi_step_stream_outputs=True, scheduling_policy='fcfs', enable_chunked_prefill=None, disable_chunked_mm_input=False, scheduler_cls='vllm.core.scheduler.Scheduler', compilation_config=None, kv_transfer_config=None, kv_events_config=None, additional_config=None, use_v2_block_manager=True, disable_log_stats=False, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, enable_server_load_tracking=False, dispatch_function=<function ServeSubcommand.cmd at 0x7f5f22168fe0>)
`rope_scaling`'s factor field must be a float >= 1, got 40
`rope_scaling`'s beta_fast field must be a float, got 32
`rope_scaling`'s beta_slow field must be a float, got 1
INFO 05-20 09:54:41 [config.py:753] This model supports multiple tasks: {'generate', 'embed', 'reward', 'classify', 'score'}. Defaulting to 'generate'.
INFO 05-20 09:54:41 [arg_utils.py:1561] rocm is experimental on VLLM_USE_V1=1. Falling back to V0 Engine.
INFO 05-20 09:54:41 [config.py:1861] Disabled the custom all-reduce kernel because it is not supported on current platform.
INFO 05-20 09:54:41 [config.py:1826] Disabled the custom all-reduce kernel because it is not working correctly on multi AMD MI250.
INFO 05-20 09:54:41 [llm_engine.py:240] Initializing a V0 LLM engine (v0.8.6.dev3+gd60b5a337) with config: model='/app/model/DeepSeek-R1-Channel-INT8/', speculative_config=None, tokenizer='/app/model/DeepSeek-R1-Channel-INT8/', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=128, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, pipeline_parallel_size=8, disable_custom_all_reduce=True, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=None, served_model_name=qwen-base, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=False, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=False, 
2025-05-20 09:54:41,876 INFO worker.py:1660 -- Connecting to existing Ray cluster at address: 10.41.18.46:6377...
2025-05-20 09:54:41,897 INFO worker.py:1852 -- Connected to Ray cluster.
INFO 05-20 09:54:41 [ray_utils.py:335] No current placement group found. Creating a new placement group.
INFO 05-20 09:54:42 [ray_distributed_executor.py:176] use_ray_spmd_worker: False
(pid=1746) INFO 05-20 09:54:45 [__init__.py:239] Automatically detected platform rocm.
INFO 05-20 09:54:47 [ray_distributed_executor.py:352] non_carry_over_env_vars from config: set()
INFO 05-20 09:54:47 [ray_distributed_executor.py:354] Copying the following environment variables to workers: ['LD_LIBRARY_PATH', 'VLLM_WORKER_MULTIPROC_METHOD', 'VLLM_USE_V1']
INFO 05-20 09:54:47 [ray_distributed_executor.py:357] If certain env vars should NOT be copied to workers, add them to /root/.config/vllm/ray_non_carry_over_env_vars.json file
INFO 05-20 09:54:48 [rocm.py:165] Using Triton MLA backend.
(raylet, ip=10.41.18.47) [2025-05-20 09:54:51,828 E 112 147] (raylet) file_system_monitor.cc:116: /tmp/ray/session_2025-05-20_09-18-04_347477_13 is over 95% full, available space: 6.92095 GB; capacity: 1756.86 GB. Object creation will fail if spilling is required.
(RayWorkerWrapper pid=2331) INFO 05-20 09:54:56 [rocm.py:165] Using Triton MLA backend.
(pid=13024, ip=10.41.18.47) INFO 05-20 09:54:46 [__init__.py:239] Automatically detected platform rocm. [repeated 15x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)
(raylet, ip=10.41.18.47) [2025-05-20 09:55:01,845 E 112 147] (raylet) file_system_monitor.cc:116: /tmp/ray/session_2025-05-20_09-18-04_347477_13 is over 95% full, available space: 6.92839 GB; capacity: 1756.86 GB. Object creation will fail if spilling is required.
(RayWorkerWrapper pid=13019, ip=10.41.18.47) [W520 09:55:08.496288850 socket.cpp:200] [c10d] The hostname of the client socket cannot be retrieved. err=-3
(RayWorkerWrapper pid=13019, ip=10.41.18.47) INFO 05-20 09:55:09 [utils.py:1205] Found nccl from library librccl.so.1
(RayWorkerWrapper pid=13019, ip=10.41.18.47) INFO 05-20 09:55:09 [pynccl.py:69] vLLM is using nccl==2.21.5
(RayWorkerWrapper pid=13024, ip=10.41.18.47) INFO 05-20 09:54:58 [rocm.py:165] Using Triton MLA backend. [repeated 14x across cluster]
INFO 05-20 09:55:09 [utils.py:1205] Found nccl from library librccl.so.1
INFO 05-20 09:55:09 [pynccl.py:69] vLLM is using nccl==2.21.5
(RayWorkerWrapper pid=2272) INFO 05-20 09:55:09 [shm_broadcast.py:266] vLLM message queue communication handle: Handle(local_reader_ranks=[1], buffer_handle=(1, 4194304, 6, 'psm_d261451d'), local_subscribe_addr='ipc:///tmp/e7f31f48-6195-4be4-8168-6e2273b02182', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 05-20 09:55:09 [shm_broadcast.py:266] vLLM message queue communication handle: Handle(local_reader_ranks=[1], buffer_handle=(1, 4194304, 6, 'psm_8bb029b6'), local_subscribe_addr='ipc:///tmp/a51e627d-0f1e-4198-9bfe-6b966288be22', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 05-20 09:55:09 [utils.py:1205] Found nccl from library librccl.so.1
INFO 05-20 09:55:09 [pynccl.py:69] vLLM is using nccl==2.21.5
INFO 05-20 09:55:10 [parallel_state.py:1004] rank 0 in world size 16 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 05-20 09:55:10 [model_runner.py:1161] Starting to load model /app/model/DeepSeek-R1-Channel-INT8/...
(RayWorkerWrapper pid=2272) INFO 05-20 09:55:10 [parallel_state.py:1004] rank 2 in world size 16 is assigned as DP rank 0, PP rank 1, TP rank 0
(RayWorkerWrapper pid=2272) INFO 05-20 09:55:10 [model_runner.py:1161] Starting to load model /app/model/DeepSeek-R1-Channel-INT8/...
INFO 05-20 09:55:10 [utils.py:106] Hidden layers were unevenly partitioned: [7,7,8,8,8,8,8,7]. This can be manually overridden using the VLLM_PP_LAYER_PARTITION environment variable
(RayWorkerWrapper pid=2272) INFO 05-20 09:55:10 [utils.py:106] Hidden layers were unevenly partitioned: [7,7,8,8,8,8,8,7]. This can be manually overridden using the VLLM_PP_LAYER_PARTITION environment variable
(raylet, ip=10.41.18.47) [2025-05-20 09:55:11,859 E 112 147] (raylet) file_system_monitor.cc:116: /tmp/ray/session_2025-05-20_09-18-04_347477_13 is over 95% full, available space: 6.92839 GB; capacity: 1756.86 GB. Object creation will fail if spilling is required.
(RayWorkerWrapper pid=2612) ERROR 05-20 09:55:21 [worker_base.py:620] Error executing method 'load_model'. This might cause deadlock in distributed execution.
(RayWorkerWrapper pid=2612) ERROR 05-20 09:55:21 [worker_base.py:620] Traceback (most recent call last):
(RayWorkerWrapper pid=2612) ERROR 05-20 09:55:21 [worker_base.py:620]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 612, in execute_method
(RayWorkerWrapper pid=2612) ERROR 05-20 09:55:21 [worker_base.py:620]     return run_method(self, method, args, kwargs)
(RayWorkerWrapper pid=2612) ERROR 05-20 09:55:21 [worker_base.py:620]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(RayWorkerWrapper pid=2612) ERROR 05-20 09:55:21 [worker_base.py:620]   File "/usr/local/lib/python3.12/dist-packages/vllm/utils.py", line 2645, in run_method
(RayWorkerWrapper pid=2612) ERROR 05-20 09:55:21 [worker_base.py:620]     return func(*args, **kwargs)
(RayWorkerWrapper pid=2612) ERROR 05-20 09:55:21 [worker_base.py:620]            ^^^^^^^^^^^^^^^^^^^^^
(RayWorkerWrapper pid=2612) ERROR 05-20 09:55:21 [worker_base.py:620]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 231, in load_model
(RayWorkerWrapper pid=2612) ERROR 05-20 09:55:21 [worker_base.py:620]     self.model_runner.load_model()
(RayWorkerWrapper pid=2612) ERROR 05-20 09:55:21 [worker_base.py:620]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1164, in load_model
(RayWorkerWrapper pid=2612) ERROR 05-20 09:55:21 [worker_base.py:620]     self.model = get_model(vllm_config=self.vllm_config)
(RayWorkerWrapper pid=2612) ERROR 05-20 09:55:21 [worker_base.py:620]                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(RayWorkerWrapper pid=2612) ERROR 05-20 09:55:21 [worker_base.py:620]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/__init__.py", line 14, in get_model
(RayWorkerWrapper pid=2612) ERROR 05-20 09:55:21 [worker_base.py:620]     return loader.load_model(vllm_config=vllm_config)
(RayWorkerWrapper pid=2612) ERROR 05-20 09:55:21 [worker_base.py:620]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(RayWorkerWrapper pid=2612) ERROR 05-20 09:55:21 [worker_base.py:620]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/loader.py", line 456, in load_model
(RayWorkerWrapper pid=2612) ERROR 05-20 09:55:21 [worker_base.py:620]     loaded_weights = model.load_weights(
(RayWorkerWrapper pid=2612) ERROR 05-20 09:55:21 [worker_base.py:620]                      ^^^^^^^^^^^^^^^^^^^
(RayWorkerWrapper pid=2612) ERROR 05-20 09:55:21 [worker_base.py:620]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 791, in load_weights
(RayWorkerWrapper pid=2612) ERROR 05-20 09:55:21 [worker_base.py:620]     param = params_dict[name]
(RayWorkerWrapper pid=2612) ERROR 05-20 09:55:21 [worker_base.py:620]             ~~~~~~~~~~~^^^^^^
(RayWorkerWrapper pid=2612) ERROR 05-20 09:55:21 [worker_base.py:620] KeyError: 'model.layers.3.mlp.experts.w2_weight_scale'
(RayWorkerWrapper pid=2331) INFO 05-20 09:55:09 [utils.py:1205] Found nccl from library librccl.so.1 [repeated 29x across cluster]
(RayWorkerWrapper pid=2331) INFO 05-20 09:55:09 [pynccl.py:69] vLLM is using nccl==2.21.5 [repeated 29x across cluster]
(RayWorkerWrapper pid=13024, ip=10.41.18.47) INFO 05-20 09:55:09 [shm_broadcast.py:266] vLLM message queue communication handle: Handle(local_reader_ranks=[1], buffer_handle=(1, 4194304, 6, 'psm_3fdc3186'), local_subscribe_addr='ipc:///tmp/a7ad559d-6d6d-4f14-a50a-126b74c1af9b', remote_subscribe_addr=None, remote_addr_ipv6=False) [repeated 6x across cluster]
(RayWorkerWrapper pid=13023, ip=10.41.18.47) INFO 05-20 09:55:10 [parallel_state.py:1004] rank 13 in world size 16 is assigned as DP rank 0, PP rank 6, TP rank 1 [repeated 14x across cluster]
(RayWorkerWrapper pid=13023, ip=10.41.18.47) INFO 05-20 09:55:10 [model_runner.py:1161] Starting to load model /app/model/DeepSeek-R1-Channel-INT8/... [repeated 14x across cluster]
(RayWorkerWrapper pid=13023, ip=10.41.18.47) INFO 05-20 09:55:10 [utils.py:106] Hidden layers were unevenly partitioned: [7,7,8,8,8,8,8,7]. This can be manually overridden using the VLLM_PP_LAYER_PARTITION environment variable [repeated 14x across cluster]
(RayWorkerWrapper pid=13021, ip=10.41.18.47) [W520 09:55:08.461267040 socket.cpp:200] [c10d] The hostname of the client socket cannot be retrieved. err=-3 [repeated 7x across cluster]
(raylet, ip=10.41.18.47) [2025-05-20 09:55:21,875 E 112 147] (raylet) file_system_monitor.cc:116: /tmp/ray/session_2025-05-20_09-18-04_347477_13 is over 95% full, available space: 6.92836 GB; capacity: 1756.86 GB. Object creation will fail if spilling is required.
Loading safetensors checkpoint shards:   0% Completed | 0/163 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:   2% Completed | 3/163 [00:00<00:07, 21.18it/s]
Loading safetensors checkpoint shards:   4% Completed | 6/163 [00:00<00:07, 20.97it/s]
ERROR 05-20 09:55:25 [worker_base.py:620] Error executing method 'load_model'. This might cause deadlock in distributed execution.
ERROR 05-20 09:55:25 [worker_base.py:620] Traceback (most recent call last):
ERROR 05-20 09:55:25 [worker_base.py:620]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 612, in execute_method
ERROR 05-20 09:55:25 [worker_base.py:620]     return run_method(self, method, args, kwargs)
ERROR 05-20 09:55:25 [worker_base.py:620]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-20 09:55:25 [worker_base.py:620]   File "/usr/local/lib/python3.12/dist-packages/vllm/utils.py", line 2645, in run_method
ERROR 05-20 09:55:25 [worker_base.py:620]     return func(*args, **kwargs)
ERROR 05-20 09:55:25 [worker_base.py:620]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 05-20 09:55:25 [worker_base.py:620]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 231, in load_model
ERROR 05-20 09:55:25 [worker_base.py:620]     self.model_runner.load_model()
ERROR 05-20 09:55:25 [worker_base.py:620]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1164, in load_model
ERROR 05-20 09:55:25 [worker_base.py:620]     self.model = get_model(vllm_config=self.vllm_config)
ERROR 05-20 09:55:25 [worker_base.py:620]                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-20 09:55:25 [worker_base.py:620]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/__init__.py", line 14, in get_model
ERROR 05-20 09:55:25 [worker_base.py:620]     return loader.load_model(vllm_config=vllm_config)
ERROR 05-20 09:55:25 [worker_base.py:620]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-20 09:55:25 [worker_base.py:620]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/loader.py", line 456, in load_model
ERROR 05-20 09:55:25 [worker_base.py:620]     loaded_weights = model.load_weights(
ERROR 05-20 09:55:25 [worker_base.py:620]                      ^^^^^^^^^^^^^^^^^^^
ERROR 05-20 09:55:25 [worker_base.py:620]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 791, in load_weights
ERROR 05-20 09:55:25 [worker_base.py:620]     param = params_dict[name]
ERROR 05-20 09:55:25 [worker_base.py:620]             ~~~~~~~~~~~^^^^^^
ERROR 05-20 09:55:25 [worker_base.py:620] KeyError: 'model.layers.3.mlp.experts.w2_weight_scale'
[rank0]: Traceback (most recent call last):
[rank0]:   File "/usr/local/bin/vllm", line 8, in <module>
[rank0]:     sys.exit(main())
[rank0]:              ^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/main.py", line 53, in main
[rank0]:     args.dispatch_function(args)
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/serve.py", line 27, in cmd
[rank0]:     uvloop.run(run_server(args))
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 109, in run
[rank0]:     return __asyncio.run(
[rank0]:            ^^^^^^^^^^^^^^
[rank0]:   File "/usr/lib/python3.12/asyncio/runners.py", line 195, in run
[rank0]:     return runner.run(main)
[rank0]:            ^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
[rank0]:     return self._loop.run_until_complete(task)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 61, in wrapper
[rank0]:     return await main
[rank0]:            ^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 1077, in run_server
[rank0]:     async with build_async_engine_client(args) as engine_client:
[rank0]:                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
[rank0]:     return await anext(self.gen)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 146, in build_async_engine_client
[rank0]:     async with build_async_engine_client_from_engine_args(
[rank0]:                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
[rank0]:     return await anext(self.gen)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 194, in build_async_engine_client_from_engine_args
[rank0]:     engine_client = AsyncLLMEngine.from_vllm_config(
[rank0]:                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 661, in from_vllm_config
[rank0]:     return cls(
[rank0]:            ^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 616, in __init__
[rank0]:     self.engine = self._engine_class(*args, **kwargs)
[rank0]:                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 267, in __init__
[rank0]:     super().__init__(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 275, in __init__
[rank0]:     self.model_executor = executor_class(vllm_config=vllm_config)
[rank0]:                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/executor_base.py", line 286, in __init__
[rank0]:     super().__init__(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/executor_base.py", line 52, in __init__
[rank0]:     self._init_executor()
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/ray_distributed_executor.py", line 114, in _init_executor
[rank0]:     self._init_workers_ray(placement_group)
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/ray_distributed_executor.py", line 396, in _init_workers_ray
[rank0]:     self._run_workers("load_model",
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/ray_distributed_executor.py", line 516, in _run_workers
[rank0]:     self.driver_worker.execute_method(sent_method, *args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 621, in execute_method
[rank0]:     raise e
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 612, in execute_method
[rank0]:     return run_method(self, method, args, kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/vllm/utils.py", line 2645, in run_method
[rank0]:     return func(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 231, in load_model
[rank0]:     self.model_runner.load_model()
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1164, in load_model
[rank0]:     self.model = get_model(vllm_config=self.vllm_config)
[rank0]:                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/__init__.py", line 14, in get_model
[rank0]:     return loader.load_model(vllm_config=vllm_config)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/loader.py", line 456, in load_model
[rank0]:     loaded_weights = model.load_weights(
[rank0]:                      ^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 791, in load_weights
[rank0]:     param = params_dict[name]
[rank0]:             ~~~~~~~~~~~^^^^^^
[rank0]: KeyError: 'model.layers.3.mlp.experts.w2_weight_scale'
(RayWorkerWrapper pid=13023, ip=10.41.18.47) ERROR 05-20 09:55:26 [worker_base.py:620] Error executing method 'load_model'. This might cause deadlock in distributed execution. [repeated 8x across cluster]
(RayWorkerWrapper pid=13023, ip=10.41.18.47) ERROR 05-20 09:55:26 [worker_base.py:620] Traceback (most recent call last): [repeated 8x across cluster]
(RayWorkerWrapper pid=13023, ip=10.41.18.47) ERROR 05-20 09:55:26 [worker_base.py:620]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 612, in execute_method [repeated 8x across cluster]
(RayWorkerWrapper pid=13023, ip=10.41.18.47) ERROR 05-20 09:55:26 [worker_base.py:620]     return run_method(self, method, args, kwargs) [repeated 8x across cluster]
(RayWorkerWrapper pid=13023, ip=10.41.18.47) ERROR 05-20 09:55:26 [worker_base.py:620]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [repeated 8x across cluster]
(RayWorkerWrapper pid=13023, ip=10.41.18.47) ERROR 05-20 09:55:26 [worker_base.py:620]   File "/usr/local/lib/python3.12/dist-packages/vllm/utils.py", line 2645, in run_method [repeated 8x across cluster]
(RayWorkerWrapper pid=13023, ip=10.41.18.47) ERROR 05-20 09:55:26 [worker_base.py:620]     return func(*args, **kwargs) [repeated 8x across cluster]
(RayWorkerWrapper pid=13023, ip=10.41.18.47) ERROR 05-20 09:55:26 [worker_base.py:620]            ^^^^^^^^^^^^^^^^^^^^^ [repeated 8x across cluster]
(RayWorkerWrapper pid=13023, ip=10.41.18.47) ERROR 05-20 09:55:26 [worker_base.py:620]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/loader.py", line 456, in load_model [repeated 24x across cluster]
(RayWorkerWrapper pid=13023, ip=10.41.18.47) ERROR 05-20 09:55:26 [worker_base.py:620]     self.model_runner.load_model() [repeated 8x across cluster]
(RayWorkerWrapper pid=13023, ip=10.41.18.47) ERROR 05-20 09:55:26 [worker_base.py:620]     self.model = get_model(vllm_config=self.vllm_config) [repeated 8x across cluster]
(RayWorkerWrapper pid=13023, ip=10.41.18.47) ERROR 05-20 09:55:26 [worker_base.py:620]                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [repeated 8x across cluster]
(RayWorkerWrapper pid=13023, ip=10.41.18.47) ERROR 05-20 09:55:26 [worker_base.py:620]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/__init__.py", line 14, in get_model [repeated 8x across cluster]
(RayWorkerWrapper pid=13023, ip=10.41.18.47) ERROR 05-20 09:55:26 [worker_base.py:620]     return loader.load_model(vllm_config=vllm_config) [repeated 8x across cluster]
(RayWorkerWrapper pid=13023, ip=10.41.18.47) ERROR 05-20 09:55:26 [worker_base.py:620]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [repeated 8x across cluster]
(RayWorkerWrapper pid=13023, ip=10.41.18.47) ERROR 05-20 09:55:26 [worker_base.py:620]     loaded_weights = model.load_weights( [repeated 8x across cluster]
(RayWorkerWrapper pid=13023, ip=10.41.18.47) ERROR 05-20 09:55:26 [worker_base.py:620]                      ^^^^^^^^^^^^^^^^^^^ [repeated 8x across cluster]
(RayWorkerWrapper pid=13023, ip=10.41.18.47) ERROR 05-20 09:55:26 [worker_base.py:620]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 791, in load_weights [repeated 8x across cluster]
(RayWorkerWrapper pid=13023, ip=10.41.18.47) ERROR 05-20 09:55:26 [worker_base.py:620]     param = params_dict[name] [repeated 8x across cluster]
(RayWorkerWrapper pid=13023, ip=10.41.18.47) ERROR 05-20 09:55:26 [worker_base.py:620]             ~~~~~~~~~~~^^^^^^ [repeated 8x across cluster]
(RayWorkerWrapper pid=13023, ip=10.41.18.47) ERROR 05-20 09:55:26 [worker_base.py:620] KeyError: 'model.layers.53.mlp.experts.w2_weight_scale' [repeated 8x across cluster]
INFO 05-20 09:55:27 [ray_distributed_executor.py:127] Shutting down Ray distributed executor. If you see error log from logging.cc regarding SIGTERM received, please ignore because this is the expected termination process in Ray.
Loading safetensors checkpoint shards:   5% Completed | 8/163 [00:01<00:36,  4.22it/s]

/usr/lib/python3.12/multiprocessing/resource_tracker.py:279: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
[rank0]:[W520 09:55:27.268350623 ProcessGroupNCCL.cpp:1476] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
```

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug]: An error occurred when deploying DeepSeek-R1-Channel-INT8 on two MI250*8 machines using Ray #555

Your current environment

🐛 Describe the bug

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug]: An error occurred when deploying DeepSeek-R1-Channel-INT8 on two MI250*8 machines using Ray #555

Description

Your current environment

🐛 Describe the bug

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions