Skip to content

[Bug]: An error occurred when deploying DeepSeek-R1-Channel-INT8 on two MI250*8 machines using Ray #555

@zinodynn

Description

@zinodynn

Your current environment

The output of python collect_env.py
python collect_env.py
INFO 05-21 07:52:06 [__init__.py:239] Automatically detected platform rocm.
Collecting environment information...
==============================
        System Info
==============================
OS                           : Ubuntu 22.04.5 LTS (x86_64)
GCC version                  : (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version                : 18.0.0git (https://github.com/RadeonOpenCompute/llvm-project roc-6.3.1 24491 1e0fda770a2079fbd71e4b70974d74f62fd3af10)
CMake version                : version 3.31.6
Libc version                 : glibc-2.35

==============================
       PyTorch Info
==============================
PyTorch version              : 2.7.0+gitf717b2a
Is debug build               : False
CUDA used to build PyTorch   : N/A
ROCM used to build PyTorch   : 6.3.42133-1b9c17779

==============================
      Python Environment
==============================
Python version               : 3.12.10 (main, Apr  9 2025, 08:55:05) [GCC 11.4.0] (64-bit runtime)
Python platform              : Linux-5.15.0-138-generic-x86_64-with-glibc2.35

==============================
       CUDA / GPU Info
==============================
Is CUDA available            : True
CUDA runtime version         : Could not collect
CUDA_MODULE_LOADING set to   : LAZY
GPU models and configuration : AMD Instinct MI210 (gfx90a:sramecc+:xnack-)
Nvidia driver version        : Could not collect
cuDNN version                : Could not collect
HIP runtime version          : 6.3.42133
MIOpen runtime version       : 3.3.0
Is XNNPACK available         : True

==============================
          CPU Info
==============================
Architecture:                         x86_64
CPU op-mode(s):                       32-bit, 64-bit
Address sizes:                        43 bits physical, 48 bits virtual
Byte Order:                           Little Endian
CPU(s):                               192
On-line CPU(s) list:                  0-191
Vendor ID:                            AuthenticAMD
Model name:                           AMD EPYC 7K62 48-Core Processor
CPU family:                           23
Model:                                49
Thread(s) per core:                   2
Core(s) per socket:                   48
Socket(s):                            2
Stepping:                             0
Frequency boost:                      enabled
CPU max MHz:                          2600.0000
CPU min MHz:                          1500.0000
BogoMIPS:                             5200.23
Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip rdpid overflow_recov succor smca sme sev sev_es
Virtualization:                       AMD-V
L1d cache:                            3 MiB (96 instances)
L1i cache:                            3 MiB (96 instances)
L2 cache:                             48 MiB (96 instances)
L3 cache:                             384 MiB (24 instances)
NUMA node(s):                         2
NUMA node0 CPU(s):                    0-47,96-143
NUMA node1 CPU(s):                    48-95,144-191
Vulnerability Gather data sampling:   Not affected
Vulnerability Itlb multihit:          Not affected
Vulnerability L1tf:                   Not affected
Vulnerability Mds:                    Not affected
Vulnerability Meltdown:               Not affected
Vulnerability Mmio stale data:        Not affected
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed:               Mitigation; untrained return thunk; SMT enabled with STIBP protection
Vulnerability Spec rstack overflow:   Mitigation; safe RET
Vulnerability Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:             Mitigation; Retpolines; IBPB conditional; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds:                  Not affected
Vulnerability Tsx async abort:        Not affected

==============================
Versions of relevant libraries
==============================
[pip3] numpy==1.26.4
[pip3] pyzmq==26.4.0
[pip3] torch==2.7.0+gitf717b2a
[pip3] torchvision==0.21.0+7af6987
[pip3] transformers==4.51.3
[pip3] triton==3.2.0+gite5be006a
[conda] Could not collect

==============================
         vLLM Info
==============================
ROCM Version                 : 6.3.42133-1b9c17779
Neuron SDK Version           : N/A
vLLM Version                 : 0.8.6.dev3+gd60b5a337 (git sha: d60b5a337)
vLLM Build Flags:
  CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
  ============================ ROCm System Management Interface ============================
================================ Weight between two GPUs =================================
       GPU0         GPU1         GPU2         GPU3         GPU4         GPU5         GPU6         GPU7         
GPU0   0            15           15           15           72           72           72           72           
GPU1   15           0            15           15           72           72           72           72           
GPU2   15           15           0            15           72           72           72           72           
GPU3   15           15           15           0            72           72           72           72           
GPU4   72           72           72           72           0            15           15           15           
GPU5   72           72           72           72           15           0            15           15           
GPU6   72           72           72           72           15           15           0            15           
GPU7   72           72           72           72           15           15           15           0            

================================= Hops between two GPUs ==================================
       GPU0         GPU1         GPU2         GPU3         GPU4         GPU5         GPU6         GPU7         
GPU0   0            1            1            1            3            3            3            3            
GPU1   1            0            1            1            3            3            3            3            
GPU2   1            1            0            1            3            3            3            3            
GPU3   1            1            1            0            3            3            3            3            
GPU4   3            3            3            3            0            1            1            1            
GPU5   3            3            3            3            1            0            1            1            
GPU6   3            3            3            3            1            1            0            1            
GPU7   3            3            3            3            1            1            1            0            

=============================== Link Type between two GPUs ===============================
       GPU0         GPU1         GPU2         GPU3         GPU4         GPU5         GPU6         GPU7         
GPU0   0            XGMI         XGMI         XGMI         PCIE         PCIE         PCIE         PCIE         
GPU1   XGMI         0            XGMI         XGMI         PCIE         PCIE         PCIE         PCIE         
GPU2   XGMI         XGMI         0            XGMI         PCIE         PCIE         PCIE         PCIE         
GPU3   XGMI         XGMI         XGMI         0            PCIE         PCIE         PCIE         PCIE         
GPU4   PCIE         PCIE         PCIE         PCIE         0            XGMI         XGMI         XGMI         
GPU5   PCIE         PCIE         PCIE         PCIE         XGMI         0            XGMI         XGMI         
GPU6   PCIE         PCIE         PCIE         PCIE         XGMI         XGMI         0            XGMI         
GPU7   PCIE         PCIE         PCIE         PCIE         XGMI         XGMI         XGMI         0            

======================================= Numa Nodes =======================================
GPU[0]          : (Topology) Numa Node: 0
GPU[0]          : (Topology) Numa Affinity: 0
GPU[1]          : (Topology) Numa Node: 0
GPU[1]          : (Topology) Numa Affinity: 0
GPU[2]          : (Topology) Numa Node: 0
GPU[2]          : (Topology) Numa Affinity: 0
GPU[3]          : (Topology) Numa Node: 0
GPU[3]          : (Topology) Numa Affinity: 0
GPU[4]          : (Topology) Numa Node: 1
GPU[4]          : (Topology) Numa Affinity: 1
GPU[5]          : (Topology) Numa Node: 1
GPU[5]          : (Topology) Numa Affinity: 1
GPU[6]          : (Topology) Numa Node: 1
GPU[6]          : (Topology) Numa Affinity: 1
GPU[7]          : (Topology) Numa Node: 1
GPU[7]          : (Topology) Numa Affinity: 1
================================== End of ROCm SMI Log ===================================

==============================
     Environment Variables
==============================
PYTORCH_TUNABLEOP_TUNING=0
PYTORCH_TUNABLEOP_ENABLED=1
VLLM_WORKER_MULTIPROC_METHOD=spawn
PYTORCH_ROCM_ARCH=gfx942
LD_LIBRARY_PATH=/opt/rocm/lib:/usr/local/lib:
PYTORCH_TUNABLEOP_FILENAME=/app/afo_tune_device_%d_full.csv
NCCL_CUMEM_ENABLE=0
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
CUDA_MODULE_LOADING=LAZY

🐛 Describe the bug

docker image:

rocm/vllm:rocm6.3.1_vllm_0.8.5_20250513

command:

vllm serve /app/model/DeepSeek-R1-Channel-INT8/ --tensor-parallel-size 2 -pp 8 --gpu-memory-utilization 0.98 --max-model-len 128 --served_model_name qwen-base --port 8004 --distributed-executor-backend ray --cpu-offload-gb 48

logs:

INFO 05-20 09:54:11 [__init__.py:239] Automatically detected platform rocm.
INFO 05-20 09:54:23 [api_server.py:1042] vLLM API server version 0.8.6.dev3+gd60b5a337
INFO 05-20 09:54:23 [api_server.py:1043] args: Namespace(subparser='serve', model_tag='/app/model/DeepSeek-R1-Channel-INT8/', config='', host=None, port=8004, uvicorn_log_level='info', disable_uvicorn_access_log=False, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, enable_ssl_refresh=False, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='/app/model/DeepSeek-R1-Channel-INT8/', task='auto', tokenizer=None, tokenizer_mode='auto', trust_remote_code=False, dtype='auto', seed=None, hf_config_path=None, allowed_local_media_path='', revision=None, code_revision=None, rope_scaling={}, rope_theta=None, tokenizer_revision=None, max_model_len=128, quantization=None, enforce_eager=False, max_seq_len_to_capture=8192, max_logprobs=20, disable_sliding_window=False, disable_cascade_attn=False, skip_tokenizer_init=False, enable_prompt_embeds=False, served_model_name=['qwen-base'], disable_async_output_proc=False, config_format='auto', hf_token=None, hf_overrides={}, override_neuron_config={}, override_pooler_config=None, logits_processor_pattern=None, generation_config='auto', override_generation_config={}, enable_sleep_mode=False, model_impl='auto', load_format='auto', download_dir=None, model_loader_extra_config={}, ignore_patterns=None, use_tqdm_on_load=True, qlora_adapter_name_or_path=None, pt_load_map_location='cpu', guided_decoding_backend='auto', guided_decoding_disable_fallback=False, guided_decoding_disable_any_whitespace=False, guided_decoding_disable_additional_properties=False, enable_reasoning=None, reasoning_parser='', distributed_executor_backend='ray', pipeline_parallel_size=8, tensor_parallel_size=2, data_parallel_size=1, enable_expert_parallel=False, max_parallel_loading_workers=None, ray_workers_use_nsight=False, disable_custom_all_reduce=False, worker_cls='auto', worker_extension_cls='', block_size=None, gpu_memory_utilization=0.98, swap_space=128.0, kv_cache_dtype='auto', num_gpu_blocks_override=None, enable_prefix_caching=None, prefix_caching_hash_algo='builtin', cpu_offload_gb=48.0, calculate_kv_scales=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config={}, limit_mm_per_prompt={}, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=None, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=None, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', speculative_config=None, show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, max_num_batched_tokens=None, max_num_seqs=None, max_num_partial_prefills=1, max_long_partial_prefills=1, cuda_graph_sizes=[512], long_prefill_token_threshold=0, num_lookahead_slots=0, scheduler_delay_factor=0.0, preemption_mode=None, num_scheduler_steps=1, multi_step_stream_outputs=True, scheduling_policy='fcfs', enable_chunked_prefill=None, disable_chunked_mm_input=False, scheduler_cls='vllm.core.scheduler.Scheduler', compilation_config=None, kv_transfer_config=None, kv_events_config=None, additional_config=None, use_v2_block_manager=True, disable_log_stats=False, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, enable_server_load_tracking=False, dispatch_function=<function ServeSubcommand.cmd at 0x7f5f22168fe0>)
`rope_scaling`'s factor field must be a float >= 1, got 40
`rope_scaling`'s beta_fast field must be a float, got 32
`rope_scaling`'s beta_slow field must be a float, got 1
INFO 05-20 09:54:41 [config.py:753] This model supports multiple tasks: {'generate', 'embed', 'reward', 'classify', 'score'}. Defaulting to 'generate'.
INFO 05-20 09:54:41 [arg_utils.py:1561] rocm is experimental on VLLM_USE_V1=1. Falling back to V0 Engine.
INFO 05-20 09:54:41 [config.py:1861] Disabled the custom all-reduce kernel because it is not supported on current platform.
INFO 05-20 09:54:41 [config.py:1826] Disabled the custom all-reduce kernel because it is not working correctly on multi AMD MI250.
INFO 05-20 09:54:41 [llm_engine.py:240] Initializing a V0 LLM engine (v0.8.6.dev3+gd60b5a337) with config: model='/app/model/DeepSeek-R1-Channel-INT8/', speculative_config=None, tokenizer='/app/model/DeepSeek-R1-Channel-INT8/', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=128, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, pipeline_parallel_size=8, disable_custom_all_reduce=True, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=None, served_model_name=qwen-base, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=False, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=False, 
2025-05-20 09:54:41,876 INFO worker.py:1660 -- Connecting to existing Ray cluster at address: 10.41.18.46:6377...
2025-05-20 09:54:41,897 INFO worker.py:1852 -- Connected to Ray cluster.
INFO 05-20 09:54:41 [ray_utils.py:335] No current placement group found. Creating a new placement group.
INFO 05-20 09:54:42 [ray_distributed_executor.py:176] use_ray_spmd_worker: False
(pid=1746) INFO 05-20 09:54:45 [__init__.py:239] Automatically detected platform rocm.
INFO 05-20 09:54:47 [ray_distributed_executor.py:352] non_carry_over_env_vars from config: set()
INFO 05-20 09:54:47 [ray_distributed_executor.py:354] Copying the following environment variables to workers: ['LD_LIBRARY_PATH', 'VLLM_WORKER_MULTIPROC_METHOD', 'VLLM_USE_V1']
INFO 05-20 09:54:47 [ray_distributed_executor.py:357] If certain env vars should NOT be copied to workers, add them to /root/.config/vllm/ray_non_carry_over_env_vars.json file
INFO 05-20 09:54:48 [rocm.py:165] Using Triton MLA backend.
(raylet, ip=10.41.18.47) [2025-05-20 09:54:51,828 E 112 147] (raylet) file_system_monitor.cc:116: /tmp/ray/session_2025-05-20_09-18-04_347477_13 is over 95% full, available space: 6.92095 GB; capacity: 1756.86 GB. Object creation will fail if spilling is required.
(RayWorkerWrapper pid=2331) INFO 05-20 09:54:56 [rocm.py:165] Using Triton MLA backend.
(pid=13024, ip=10.41.18.47) INFO 05-20 09:54:46 [__init__.py:239] Automatically detected platform rocm. [repeated 15x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)
(raylet, ip=10.41.18.47) [2025-05-20 09:55:01,845 E 112 147] (raylet) file_system_monitor.cc:116: /tmp/ray/session_2025-05-20_09-18-04_347477_13 is over 95% full, available space: 6.92839 GB; capacity: 1756.86 GB. Object creation will fail if spilling is required.
(RayWorkerWrapper pid=13019, ip=10.41.18.47) [W520 09:55:08.496288850 socket.cpp:200] [c10d] The hostname of the client socket cannot be retrieved. err=-3
(RayWorkerWrapper pid=13019, ip=10.41.18.47) INFO 05-20 09:55:09 [utils.py:1205] Found nccl from library librccl.so.1
(RayWorkerWrapper pid=13019, ip=10.41.18.47) INFO 05-20 09:55:09 [pynccl.py:69] vLLM is using nccl==2.21.5
(RayWorkerWrapper pid=13024, ip=10.41.18.47) INFO 05-20 09:54:58 [rocm.py:165] Using Triton MLA backend. [repeated 14x across cluster]
INFO 05-20 09:55:09 [utils.py:1205] Found nccl from library librccl.so.1
INFO 05-20 09:55:09 [pynccl.py:69] vLLM is using nccl==2.21.5
(RayWorkerWrapper pid=2272) INFO 05-20 09:55:09 [shm_broadcast.py:266] vLLM message queue communication handle: Handle(local_reader_ranks=[1], buffer_handle=(1, 4194304, 6, 'psm_d261451d'), local_subscribe_addr='ipc:///tmp/e7f31f48-6195-4be4-8168-6e2273b02182', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 05-20 09:55:09 [shm_broadcast.py:266] vLLM message queue communication handle: Handle(local_reader_ranks=[1], buffer_handle=(1, 4194304, 6, 'psm_8bb029b6'), local_subscribe_addr='ipc:///tmp/a51e627d-0f1e-4198-9bfe-6b966288be22', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 05-20 09:55:09 [utils.py:1205] Found nccl from library librccl.so.1
INFO 05-20 09:55:09 [pynccl.py:69] vLLM is using nccl==2.21.5
INFO 05-20 09:55:10 [parallel_state.py:1004] rank 0 in world size 16 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 05-20 09:55:10 [model_runner.py:1161] Starting to load model /app/model/DeepSeek-R1-Channel-INT8/...
(RayWorkerWrapper pid=2272) INFO 05-20 09:55:10 [parallel_state.py:1004] rank 2 in world size 16 is assigned as DP rank 0, PP rank 1, TP rank 0
(RayWorkerWrapper pid=2272) INFO 05-20 09:55:10 [model_runner.py:1161] Starting to load model /app/model/DeepSeek-R1-Channel-INT8/...
INFO 05-20 09:55:10 [utils.py:106] Hidden layers were unevenly partitioned: [7,7,8,8,8,8,8,7]. This can be manually overridden using the VLLM_PP_LAYER_PARTITION environment variable
(RayWorkerWrapper pid=2272) INFO 05-20 09:55:10 [utils.py:106] Hidden layers were unevenly partitioned: [7,7,8,8,8,8,8,7]. This can be manually overridden using the VLLM_PP_LAYER_PARTITION environment variable
(raylet, ip=10.41.18.47) [2025-05-20 09:55:11,859 E 112 147] (raylet) file_system_monitor.cc:116: /tmp/ray/session_2025-05-20_09-18-04_347477_13 is over 95% full, available space: 6.92839 GB; capacity: 1756.86 GB. Object creation will fail if spilling is required.
(RayWorkerWrapper pid=2612) ERROR 05-20 09:55:21 [worker_base.py:620] Error executing method 'load_model'. This might cause deadlock in distributed execution.
(RayWorkerWrapper pid=2612) ERROR 05-20 09:55:21 [worker_base.py:620] Traceback (most recent call last):
(RayWorkerWrapper pid=2612) ERROR 05-20 09:55:21 [worker_base.py:620]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 612, in execute_method
(RayWorkerWrapper pid=2612) ERROR 05-20 09:55:21 [worker_base.py:620]     return run_method(self, method, args, kwargs)
(RayWorkerWrapper pid=2612) ERROR 05-20 09:55:21 [worker_base.py:620]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(RayWorkerWrapper pid=2612) ERROR 05-20 09:55:21 [worker_base.py:620]   File "/usr/local/lib/python3.12/dist-packages/vllm/utils.py", line 2645, in run_method
(RayWorkerWrapper pid=2612) ERROR 05-20 09:55:21 [worker_base.py:620]     return func(*args, **kwargs)
(RayWorkerWrapper pid=2612) ERROR 05-20 09:55:21 [worker_base.py:620]            ^^^^^^^^^^^^^^^^^^^^^
(RayWorkerWrapper pid=2612) ERROR 05-20 09:55:21 [worker_base.py:620]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 231, in load_model
(RayWorkerWrapper pid=2612) ERROR 05-20 09:55:21 [worker_base.py:620]     self.model_runner.load_model()
(RayWorkerWrapper pid=2612) ERROR 05-20 09:55:21 [worker_base.py:620]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1164, in load_model
(RayWorkerWrapper pid=2612) ERROR 05-20 09:55:21 [worker_base.py:620]     self.model = get_model(vllm_config=self.vllm_config)
(RayWorkerWrapper pid=2612) ERROR 05-20 09:55:21 [worker_base.py:620]                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(RayWorkerWrapper pid=2612) ERROR 05-20 09:55:21 [worker_base.py:620]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/__init__.py", line 14, in get_model
(RayWorkerWrapper pid=2612) ERROR 05-20 09:55:21 [worker_base.py:620]     return loader.load_model(vllm_config=vllm_config)
(RayWorkerWrapper pid=2612) ERROR 05-20 09:55:21 [worker_base.py:620]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(RayWorkerWrapper pid=2612) ERROR 05-20 09:55:21 [worker_base.py:620]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/loader.py", line 456, in load_model
(RayWorkerWrapper pid=2612) ERROR 05-20 09:55:21 [worker_base.py:620]     loaded_weights = model.load_weights(
(RayWorkerWrapper pid=2612) ERROR 05-20 09:55:21 [worker_base.py:620]                      ^^^^^^^^^^^^^^^^^^^
(RayWorkerWrapper pid=2612) ERROR 05-20 09:55:21 [worker_base.py:620]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 791, in load_weights
(RayWorkerWrapper pid=2612) ERROR 05-20 09:55:21 [worker_base.py:620]     param = params_dict[name]
(RayWorkerWrapper pid=2612) ERROR 05-20 09:55:21 [worker_base.py:620]             ~~~~~~~~~~~^^^^^^
(RayWorkerWrapper pid=2612) ERROR 05-20 09:55:21 [worker_base.py:620] KeyError: 'model.layers.3.mlp.experts.w2_weight_scale'
(RayWorkerWrapper pid=2331) INFO 05-20 09:55:09 [utils.py:1205] Found nccl from library librccl.so.1 [repeated 29x across cluster]
(RayWorkerWrapper pid=2331) INFO 05-20 09:55:09 [pynccl.py:69] vLLM is using nccl==2.21.5 [repeated 29x across cluster]
(RayWorkerWrapper pid=13024, ip=10.41.18.47) INFO 05-20 09:55:09 [shm_broadcast.py:266] vLLM message queue communication handle: Handle(local_reader_ranks=[1], buffer_handle=(1, 4194304, 6, 'psm_3fdc3186'), local_subscribe_addr='ipc:///tmp/a7ad559d-6d6d-4f14-a50a-126b74c1af9b', remote_subscribe_addr=None, remote_addr_ipv6=False) [repeated 6x across cluster]
(RayWorkerWrapper pid=13023, ip=10.41.18.47) INFO 05-20 09:55:10 [parallel_state.py:1004] rank 13 in world size 16 is assigned as DP rank 0, PP rank 6, TP rank 1 [repeated 14x across cluster]
(RayWorkerWrapper pid=13023, ip=10.41.18.47) INFO 05-20 09:55:10 [model_runner.py:1161] Starting to load model /app/model/DeepSeek-R1-Channel-INT8/... [repeated 14x across cluster]
(RayWorkerWrapper pid=13023, ip=10.41.18.47) INFO 05-20 09:55:10 [utils.py:106] Hidden layers were unevenly partitioned: [7,7,8,8,8,8,8,7]. This can be manually overridden using the VLLM_PP_LAYER_PARTITION environment variable [repeated 14x across cluster]
(RayWorkerWrapper pid=13021, ip=10.41.18.47) [W520 09:55:08.461267040 socket.cpp:200] [c10d] The hostname of the client socket cannot be retrieved. err=-3 [repeated 7x across cluster]
(raylet, ip=10.41.18.47) [2025-05-20 09:55:21,875 E 112 147] (raylet) file_system_monitor.cc:116: /tmp/ray/session_2025-05-20_09-18-04_347477_13 is over 95% full, available space: 6.92836 GB; capacity: 1756.86 GB. Object creation will fail if spilling is required.
Loading safetensors checkpoint shards:   0% Completed | 0/163 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:   2% Completed | 3/163 [00:00<00:07, 21.18it/s]
Loading safetensors checkpoint shards:   4% Completed | 6/163 [00:00<00:07, 20.97it/s]
ERROR 05-20 09:55:25 [worker_base.py:620] Error executing method 'load_model'. This might cause deadlock in distributed execution.
ERROR 05-20 09:55:25 [worker_base.py:620] Traceback (most recent call last):
ERROR 05-20 09:55:25 [worker_base.py:620]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 612, in execute_method
ERROR 05-20 09:55:25 [worker_base.py:620]     return run_method(self, method, args, kwargs)
ERROR 05-20 09:55:25 [worker_base.py:620]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-20 09:55:25 [worker_base.py:620]   File "/usr/local/lib/python3.12/dist-packages/vllm/utils.py", line 2645, in run_method
ERROR 05-20 09:55:25 [worker_base.py:620]     return func(*args, **kwargs)
ERROR 05-20 09:55:25 [worker_base.py:620]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 05-20 09:55:25 [worker_base.py:620]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 231, in load_model
ERROR 05-20 09:55:25 [worker_base.py:620]     self.model_runner.load_model()
ERROR 05-20 09:55:25 [worker_base.py:620]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1164, in load_model
ERROR 05-20 09:55:25 [worker_base.py:620]     self.model = get_model(vllm_config=self.vllm_config)
ERROR 05-20 09:55:25 [worker_base.py:620]                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-20 09:55:25 [worker_base.py:620]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/__init__.py", line 14, in get_model
ERROR 05-20 09:55:25 [worker_base.py:620]     return loader.load_model(vllm_config=vllm_config)
ERROR 05-20 09:55:25 [worker_base.py:620]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-20 09:55:25 [worker_base.py:620]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/loader.py", line 456, in load_model
ERROR 05-20 09:55:25 [worker_base.py:620]     loaded_weights = model.load_weights(
ERROR 05-20 09:55:25 [worker_base.py:620]                      ^^^^^^^^^^^^^^^^^^^
ERROR 05-20 09:55:25 [worker_base.py:620]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 791, in load_weights
ERROR 05-20 09:55:25 [worker_base.py:620]     param = params_dict[name]
ERROR 05-20 09:55:25 [worker_base.py:620]             ~~~~~~~~~~~^^^^^^
ERROR 05-20 09:55:25 [worker_base.py:620] KeyError: 'model.layers.3.mlp.experts.w2_weight_scale'
[rank0]: Traceback (most recent call last):
[rank0]:   File "/usr/local/bin/vllm", line 8, in <module>
[rank0]:     sys.exit(main())
[rank0]:              ^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/main.py", line 53, in main
[rank0]:     args.dispatch_function(args)
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/serve.py", line 27, in cmd
[rank0]:     uvloop.run(run_server(args))
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 109, in run
[rank0]:     return __asyncio.run(
[rank0]:            ^^^^^^^^^^^^^^
[rank0]:   File "/usr/lib/python3.12/asyncio/runners.py", line 195, in run
[rank0]:     return runner.run(main)
[rank0]:            ^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
[rank0]:     return self._loop.run_until_complete(task)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 61, in wrapper
[rank0]:     return await main
[rank0]:            ^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 1077, in run_server
[rank0]:     async with build_async_engine_client(args) as engine_client:
[rank0]:                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
[rank0]:     return await anext(self.gen)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 146, in build_async_engine_client
[rank0]:     async with build_async_engine_client_from_engine_args(
[rank0]:                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
[rank0]:     return await anext(self.gen)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 194, in build_async_engine_client_from_engine_args
[rank0]:     engine_client = AsyncLLMEngine.from_vllm_config(
[rank0]:                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 661, in from_vllm_config
[rank0]:     return cls(
[rank0]:            ^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 616, in __init__
[rank0]:     self.engine = self._engine_class(*args, **kwargs)
[rank0]:                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 267, in __init__
[rank0]:     super().__init__(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 275, in __init__
[rank0]:     self.model_executor = executor_class(vllm_config=vllm_config)
[rank0]:                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/executor_base.py", line 286, in __init__
[rank0]:     super().__init__(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/executor_base.py", line 52, in __init__
[rank0]:     self._init_executor()
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/ray_distributed_executor.py", line 114, in _init_executor
[rank0]:     self._init_workers_ray(placement_group)
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/ray_distributed_executor.py", line 396, in _init_workers_ray
[rank0]:     self._run_workers("load_model",
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/ray_distributed_executor.py", line 516, in _run_workers
[rank0]:     self.driver_worker.execute_method(sent_method, *args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 621, in execute_method
[rank0]:     raise e
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 612, in execute_method
[rank0]:     return run_method(self, method, args, kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/vllm/utils.py", line 2645, in run_method
[rank0]:     return func(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 231, in load_model
[rank0]:     self.model_runner.load_model()
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1164, in load_model
[rank0]:     self.model = get_model(vllm_config=self.vllm_config)
[rank0]:                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/__init__.py", line 14, in get_model
[rank0]:     return loader.load_model(vllm_config=vllm_config)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/loader.py", line 456, in load_model
[rank0]:     loaded_weights = model.load_weights(
[rank0]:                      ^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 791, in load_weights
[rank0]:     param = params_dict[name]
[rank0]:             ~~~~~~~~~~~^^^^^^
[rank0]: KeyError: 'model.layers.3.mlp.experts.w2_weight_scale'
(RayWorkerWrapper pid=13023, ip=10.41.18.47) ERROR 05-20 09:55:26 [worker_base.py:620] Error executing method 'load_model'. This might cause deadlock in distributed execution. [repeated 8x across cluster]
(RayWorkerWrapper pid=13023, ip=10.41.18.47) ERROR 05-20 09:55:26 [worker_base.py:620] Traceback (most recent call last): [repeated 8x across cluster]
(RayWorkerWrapper pid=13023, ip=10.41.18.47) ERROR 05-20 09:55:26 [worker_base.py:620]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 612, in execute_method [repeated 8x across cluster]
(RayWorkerWrapper pid=13023, ip=10.41.18.47) ERROR 05-20 09:55:26 [worker_base.py:620]     return run_method(self, method, args, kwargs) [repeated 8x across cluster]
(RayWorkerWrapper pid=13023, ip=10.41.18.47) ERROR 05-20 09:55:26 [worker_base.py:620]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [repeated 8x across cluster]
(RayWorkerWrapper pid=13023, ip=10.41.18.47) ERROR 05-20 09:55:26 [worker_base.py:620]   File "/usr/local/lib/python3.12/dist-packages/vllm/utils.py", line 2645, in run_method [repeated 8x across cluster]
(RayWorkerWrapper pid=13023, ip=10.41.18.47) ERROR 05-20 09:55:26 [worker_base.py:620]     return func(*args, **kwargs) [repeated 8x across cluster]
(RayWorkerWrapper pid=13023, ip=10.41.18.47) ERROR 05-20 09:55:26 [worker_base.py:620]            ^^^^^^^^^^^^^^^^^^^^^ [repeated 8x across cluster]
(RayWorkerWrapper pid=13023, ip=10.41.18.47) ERROR 05-20 09:55:26 [worker_base.py:620]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/loader.py", line 456, in load_model [repeated 24x across cluster]
(RayWorkerWrapper pid=13023, ip=10.41.18.47) ERROR 05-20 09:55:26 [worker_base.py:620]     self.model_runner.load_model() [repeated 8x across cluster]
(RayWorkerWrapper pid=13023, ip=10.41.18.47) ERROR 05-20 09:55:26 [worker_base.py:620]     self.model = get_model(vllm_config=self.vllm_config) [repeated 8x across cluster]
(RayWorkerWrapper pid=13023, ip=10.41.18.47) ERROR 05-20 09:55:26 [worker_base.py:620]                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [repeated 8x across cluster]
(RayWorkerWrapper pid=13023, ip=10.41.18.47) ERROR 05-20 09:55:26 [worker_base.py:620]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/__init__.py", line 14, in get_model [repeated 8x across cluster]
(RayWorkerWrapper pid=13023, ip=10.41.18.47) ERROR 05-20 09:55:26 [worker_base.py:620]     return loader.load_model(vllm_config=vllm_config) [repeated 8x across cluster]
(RayWorkerWrapper pid=13023, ip=10.41.18.47) ERROR 05-20 09:55:26 [worker_base.py:620]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [repeated 8x across cluster]
(RayWorkerWrapper pid=13023, ip=10.41.18.47) ERROR 05-20 09:55:26 [worker_base.py:620]     loaded_weights = model.load_weights( [repeated 8x across cluster]
(RayWorkerWrapper pid=13023, ip=10.41.18.47) ERROR 05-20 09:55:26 [worker_base.py:620]                      ^^^^^^^^^^^^^^^^^^^ [repeated 8x across cluster]
(RayWorkerWrapper pid=13023, ip=10.41.18.47) ERROR 05-20 09:55:26 [worker_base.py:620]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 791, in load_weights [repeated 8x across cluster]
(RayWorkerWrapper pid=13023, ip=10.41.18.47) ERROR 05-20 09:55:26 [worker_base.py:620]     param = params_dict[name] [repeated 8x across cluster]
(RayWorkerWrapper pid=13023, ip=10.41.18.47) ERROR 05-20 09:55:26 [worker_base.py:620]             ~~~~~~~~~~~^^^^^^ [repeated 8x across cluster]
(RayWorkerWrapper pid=13023, ip=10.41.18.47) ERROR 05-20 09:55:26 [worker_base.py:620] KeyError: 'model.layers.53.mlp.experts.w2_weight_scale' [repeated 8x across cluster]
INFO 05-20 09:55:27 [ray_distributed_executor.py:127] Shutting down Ray distributed executor. If you see error log from logging.cc regarding SIGTERM received, please ignore because this is the expected termination process in Ray.
Loading safetensors checkpoint shards:   5% Completed | 8/163 [00:01<00:36,  4.22it/s]

/usr/lib/python3.12/multiprocessing/resource_tracker.py:279: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
[rank0]:[W520 09:55:27.268350623 ProcessGroupNCCL.cpp:1476] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions