forked from vllm-project/vllm
-
Notifications
You must be signed in to change notification settings - Fork 49
Open
Labels
Description
Your current environment
The output of python collect_env.py
python collect_env.py
INFO 05-21 07:52:06 [__init__.py:239] Automatically detected platform rocm.
Collecting environment information...
==============================
System Info
==============================
OS : Ubuntu 22.04.5 LTS (x86_64)
GCC version : (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version : 18.0.0git (https://github.com/RadeonOpenCompute/llvm-project roc-6.3.1 24491 1e0fda770a2079fbd71e4b70974d74f62fd3af10)
CMake version : version 3.31.6
Libc version : glibc-2.35
==============================
PyTorch Info
==============================
PyTorch version : 2.7.0+gitf717b2a
Is debug build : False
CUDA used to build PyTorch : N/A
ROCM used to build PyTorch : 6.3.42133-1b9c17779
==============================
Python Environment
==============================
Python version : 3.12.10 (main, Apr 9 2025, 08:55:05) [GCC 11.4.0] (64-bit runtime)
Python platform : Linux-5.15.0-138-generic-x86_64-with-glibc2.35
==============================
CUDA / GPU Info
==============================
Is CUDA available : True
CUDA runtime version : Could not collect
CUDA_MODULE_LOADING set to : LAZY
GPU models and configuration : AMD Instinct MI210 (gfx90a:sramecc+:xnack-)
Nvidia driver version : Could not collect
cuDNN version : Could not collect
HIP runtime version : 6.3.42133
MIOpen runtime version : 3.3.0
Is XNNPACK available : True
==============================
CPU Info
==============================
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 43 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 192
On-line CPU(s) list: 0-191
Vendor ID: AuthenticAMD
Model name: AMD EPYC 7K62 48-Core Processor
CPU family: 23
Model: 49
Thread(s) per core: 2
Core(s) per socket: 48
Socket(s): 2
Stepping: 0
Frequency boost: enabled
CPU max MHz: 2600.0000
CPU min MHz: 1500.0000
BogoMIPS: 5200.23
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip rdpid overflow_recov succor smca sme sev sev_es
Virtualization: AMD-V
L1d cache: 3 MiB (96 instances)
L1i cache: 3 MiB (96 instances)
L2 cache: 48 MiB (96 instances)
L3 cache: 384 MiB (24 instances)
NUMA node(s): 2
NUMA node0 CPU(s): 0-47,96-143
NUMA node1 CPU(s): 48-95,144-191
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Mmio stale data: Not affected
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed: Mitigation; untrained return thunk; SMT enabled with STIBP protection
Vulnerability Spec rstack overflow: Mitigation; safe RET
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Mitigation; Retpolines; IBPB conditional; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected
==============================
Versions of relevant libraries
==============================
[pip3] numpy==1.26.4
[pip3] pyzmq==26.4.0
[pip3] torch==2.7.0+gitf717b2a
[pip3] torchvision==0.21.0+7af6987
[pip3] transformers==4.51.3
[pip3] triton==3.2.0+gite5be006a
[conda] Could not collect
==============================
vLLM Info
==============================
ROCM Version : 6.3.42133-1b9c17779
Neuron SDK Version : N/A
vLLM Version : 0.8.6.dev3+gd60b5a337 (git sha: d60b5a337)
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
============================ ROCm System Management Interface ============================
================================ Weight between two GPUs =================================
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7
GPU0 0 15 15 15 72 72 72 72
GPU1 15 0 15 15 72 72 72 72
GPU2 15 15 0 15 72 72 72 72
GPU3 15 15 15 0 72 72 72 72
GPU4 72 72 72 72 0 15 15 15
GPU5 72 72 72 72 15 0 15 15
GPU6 72 72 72 72 15 15 0 15
GPU7 72 72 72 72 15 15 15 0
================================= Hops between two GPUs ==================================
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7
GPU0 0 1 1 1 3 3 3 3
GPU1 1 0 1 1 3 3 3 3
GPU2 1 1 0 1 3 3 3 3
GPU3 1 1 1 0 3 3 3 3
GPU4 3 3 3 3 0 1 1 1
GPU5 3 3 3 3 1 0 1 1
GPU6 3 3 3 3 1 1 0 1
GPU7 3 3 3 3 1 1 1 0
=============================== Link Type between two GPUs ===============================
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7
GPU0 0 XGMI XGMI XGMI PCIE PCIE PCIE PCIE
GPU1 XGMI 0 XGMI XGMI PCIE PCIE PCIE PCIE
GPU2 XGMI XGMI 0 XGMI PCIE PCIE PCIE PCIE
GPU3 XGMI XGMI XGMI 0 PCIE PCIE PCIE PCIE
GPU4 PCIE PCIE PCIE PCIE 0 XGMI XGMI XGMI
GPU5 PCIE PCIE PCIE PCIE XGMI 0 XGMI XGMI
GPU6 PCIE PCIE PCIE PCIE XGMI XGMI 0 XGMI
GPU7 PCIE PCIE PCIE PCIE XGMI XGMI XGMI 0
======================================= Numa Nodes =======================================
GPU[0] : (Topology) Numa Node: 0
GPU[0] : (Topology) Numa Affinity: 0
GPU[1] : (Topology) Numa Node: 0
GPU[1] : (Topology) Numa Affinity: 0
GPU[2] : (Topology) Numa Node: 0
GPU[2] : (Topology) Numa Affinity: 0
GPU[3] : (Topology) Numa Node: 0
GPU[3] : (Topology) Numa Affinity: 0
GPU[4] : (Topology) Numa Node: 1
GPU[4] : (Topology) Numa Affinity: 1
GPU[5] : (Topology) Numa Node: 1
GPU[5] : (Topology) Numa Affinity: 1
GPU[6] : (Topology) Numa Node: 1
GPU[6] : (Topology) Numa Affinity: 1
GPU[7] : (Topology) Numa Node: 1
GPU[7] : (Topology) Numa Affinity: 1
================================== End of ROCm SMI Log ===================================
==============================
Environment Variables
==============================
PYTORCH_TUNABLEOP_TUNING=0
PYTORCH_TUNABLEOP_ENABLED=1
VLLM_WORKER_MULTIPROC_METHOD=spawn
PYTORCH_ROCM_ARCH=gfx942
LD_LIBRARY_PATH=/opt/rocm/lib:/usr/local/lib:
PYTORCH_TUNABLEOP_FILENAME=/app/afo_tune_device_%d_full.csv
NCCL_CUMEM_ENABLE=0
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
CUDA_MODULE_LOADING=LAZY
🐛 Describe the bug
docker image:
rocm/vllm:rocm6.3.1_vllm_0.8.5_20250513
command:
vllm serve /app/model/DeepSeek-R1-Channel-INT8/ --tensor-parallel-size 2 -pp 8 --gpu-memory-utilization 0.98 --max-model-len 128 --served_model_name qwen-base --port 8004 --distributed-executor-backend ray --cpu-offload-gb 48
logs:
INFO 05-20 09:54:11 [__init__.py:239] Automatically detected platform rocm.
INFO 05-20 09:54:23 [api_server.py:1042] vLLM API server version 0.8.6.dev3+gd60b5a337
INFO 05-20 09:54:23 [api_server.py:1043] args: Namespace(subparser='serve', model_tag='/app/model/DeepSeek-R1-Channel-INT8/', config='', host=None, port=8004, uvicorn_log_level='info', disable_uvicorn_access_log=False, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, enable_ssl_refresh=False, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='/app/model/DeepSeek-R1-Channel-INT8/', task='auto', tokenizer=None, tokenizer_mode='auto', trust_remote_code=False, dtype='auto', seed=None, hf_config_path=None, allowed_local_media_path='', revision=None, code_revision=None, rope_scaling={}, rope_theta=None, tokenizer_revision=None, max_model_len=128, quantization=None, enforce_eager=False, max_seq_len_to_capture=8192, max_logprobs=20, disable_sliding_window=False, disable_cascade_attn=False, skip_tokenizer_init=False, enable_prompt_embeds=False, served_model_name=['qwen-base'], disable_async_output_proc=False, config_format='auto', hf_token=None, hf_overrides={}, override_neuron_config={}, override_pooler_config=None, logits_processor_pattern=None, generation_config='auto', override_generation_config={}, enable_sleep_mode=False, model_impl='auto', load_format='auto', download_dir=None, model_loader_extra_config={}, ignore_patterns=None, use_tqdm_on_load=True, qlora_adapter_name_or_path=None, pt_load_map_location='cpu', guided_decoding_backend='auto', guided_decoding_disable_fallback=False, guided_decoding_disable_any_whitespace=False, guided_decoding_disable_additional_properties=False, enable_reasoning=None, reasoning_parser='', distributed_executor_backend='ray', pipeline_parallel_size=8, tensor_parallel_size=2, data_parallel_size=1, enable_expert_parallel=False, max_parallel_loading_workers=None, ray_workers_use_nsight=False, disable_custom_all_reduce=False, worker_cls='auto', worker_extension_cls='', block_size=None, gpu_memory_utilization=0.98, swap_space=128.0, kv_cache_dtype='auto', num_gpu_blocks_override=None, enable_prefix_caching=None, prefix_caching_hash_algo='builtin', cpu_offload_gb=48.0, calculate_kv_scales=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config={}, limit_mm_per_prompt={}, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=None, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=None, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', speculative_config=None, show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, max_num_batched_tokens=None, max_num_seqs=None, max_num_partial_prefills=1, max_long_partial_prefills=1, cuda_graph_sizes=[512], long_prefill_token_threshold=0, num_lookahead_slots=0, scheduler_delay_factor=0.0, preemption_mode=None, num_scheduler_steps=1, multi_step_stream_outputs=True, scheduling_policy='fcfs', enable_chunked_prefill=None, disable_chunked_mm_input=False, scheduler_cls='vllm.core.scheduler.Scheduler', compilation_config=None, kv_transfer_config=None, kv_events_config=None, additional_config=None, use_v2_block_manager=True, disable_log_stats=False, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, enable_server_load_tracking=False, dispatch_function=<function ServeSubcommand.cmd at 0x7f5f22168fe0>)
`rope_scaling`'s factor field must be a float >= 1, got 40
`rope_scaling`'s beta_fast field must be a float, got 32
`rope_scaling`'s beta_slow field must be a float, got 1
INFO 05-20 09:54:41 [config.py:753] This model supports multiple tasks: {'generate', 'embed', 'reward', 'classify', 'score'}. Defaulting to 'generate'.
INFO 05-20 09:54:41 [arg_utils.py:1561] rocm is experimental on VLLM_USE_V1=1. Falling back to V0 Engine.
INFO 05-20 09:54:41 [config.py:1861] Disabled the custom all-reduce kernel because it is not supported on current platform.
INFO 05-20 09:54:41 [config.py:1826] Disabled the custom all-reduce kernel because it is not working correctly on multi AMD MI250.
INFO 05-20 09:54:41 [llm_engine.py:240] Initializing a V0 LLM engine (v0.8.6.dev3+gd60b5a337) with config: model='/app/model/DeepSeek-R1-Channel-INT8/', speculative_config=None, tokenizer='/app/model/DeepSeek-R1-Channel-INT8/', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=128, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, pipeline_parallel_size=8, disable_custom_all_reduce=True, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=None, served_model_name=qwen-base, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=False, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=False,
2025-05-20 09:54:41,876 INFO worker.py:1660 -- Connecting to existing Ray cluster at address: 10.41.18.46:6377...
2025-05-20 09:54:41,897 INFO worker.py:1852 -- Connected to Ray cluster.
INFO 05-20 09:54:41 [ray_utils.py:335] No current placement group found. Creating a new placement group.
INFO 05-20 09:54:42 [ray_distributed_executor.py:176] use_ray_spmd_worker: False
(pid=1746) INFO 05-20 09:54:45 [__init__.py:239] Automatically detected platform rocm.
INFO 05-20 09:54:47 [ray_distributed_executor.py:352] non_carry_over_env_vars from config: set()
INFO 05-20 09:54:47 [ray_distributed_executor.py:354] Copying the following environment variables to workers: ['LD_LIBRARY_PATH', 'VLLM_WORKER_MULTIPROC_METHOD', 'VLLM_USE_V1']
INFO 05-20 09:54:47 [ray_distributed_executor.py:357] If certain env vars should NOT be copied to workers, add them to /root/.config/vllm/ray_non_carry_over_env_vars.json file
INFO 05-20 09:54:48 [rocm.py:165] Using Triton MLA backend.
(raylet, ip=10.41.18.47) [2025-05-20 09:54:51,828 E 112 147] (raylet) file_system_monitor.cc:116: /tmp/ray/session_2025-05-20_09-18-04_347477_13 is over 95% full, available space: 6.92095 GB; capacity: 1756.86 GB. Object creation will fail if spilling is required.
(RayWorkerWrapper pid=2331) INFO 05-20 09:54:56 [rocm.py:165] Using Triton MLA backend.
(pid=13024, ip=10.41.18.47) INFO 05-20 09:54:46 [__init__.py:239] Automatically detected platform rocm. [repeated 15x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)
(raylet, ip=10.41.18.47) [2025-05-20 09:55:01,845 E 112 147] (raylet) file_system_monitor.cc:116: /tmp/ray/session_2025-05-20_09-18-04_347477_13 is over 95% full, available space: 6.92839 GB; capacity: 1756.86 GB. Object creation will fail if spilling is required.
(RayWorkerWrapper pid=13019, ip=10.41.18.47) [W520 09:55:08.496288850 socket.cpp:200] [c10d] The hostname of the client socket cannot be retrieved. err=-3
(RayWorkerWrapper pid=13019, ip=10.41.18.47) INFO 05-20 09:55:09 [utils.py:1205] Found nccl from library librccl.so.1
(RayWorkerWrapper pid=13019, ip=10.41.18.47) INFO 05-20 09:55:09 [pynccl.py:69] vLLM is using nccl==2.21.5
(RayWorkerWrapper pid=13024, ip=10.41.18.47) INFO 05-20 09:54:58 [rocm.py:165] Using Triton MLA backend. [repeated 14x across cluster]
INFO 05-20 09:55:09 [utils.py:1205] Found nccl from library librccl.so.1
INFO 05-20 09:55:09 [pynccl.py:69] vLLM is using nccl==2.21.5
(RayWorkerWrapper pid=2272) INFO 05-20 09:55:09 [shm_broadcast.py:266] vLLM message queue communication handle: Handle(local_reader_ranks=[1], buffer_handle=(1, 4194304, 6, 'psm_d261451d'), local_subscribe_addr='ipc:///tmp/e7f31f48-6195-4be4-8168-6e2273b02182', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 05-20 09:55:09 [shm_broadcast.py:266] vLLM message queue communication handle: Handle(local_reader_ranks=[1], buffer_handle=(1, 4194304, 6, 'psm_8bb029b6'), local_subscribe_addr='ipc:///tmp/a51e627d-0f1e-4198-9bfe-6b966288be22', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 05-20 09:55:09 [utils.py:1205] Found nccl from library librccl.so.1
INFO 05-20 09:55:09 [pynccl.py:69] vLLM is using nccl==2.21.5
INFO 05-20 09:55:10 [parallel_state.py:1004] rank 0 in world size 16 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 05-20 09:55:10 [model_runner.py:1161] Starting to load model /app/model/DeepSeek-R1-Channel-INT8/...
(RayWorkerWrapper pid=2272) INFO 05-20 09:55:10 [parallel_state.py:1004] rank 2 in world size 16 is assigned as DP rank 0, PP rank 1, TP rank 0
(RayWorkerWrapper pid=2272) INFO 05-20 09:55:10 [model_runner.py:1161] Starting to load model /app/model/DeepSeek-R1-Channel-INT8/...
INFO 05-20 09:55:10 [utils.py:106] Hidden layers were unevenly partitioned: [7,7,8,8,8,8,8,7]. This can be manually overridden using the VLLM_PP_LAYER_PARTITION environment variable
(RayWorkerWrapper pid=2272) INFO 05-20 09:55:10 [utils.py:106] Hidden layers were unevenly partitioned: [7,7,8,8,8,8,8,7]. This can be manually overridden using the VLLM_PP_LAYER_PARTITION environment variable
(raylet, ip=10.41.18.47) [2025-05-20 09:55:11,859 E 112 147] (raylet) file_system_monitor.cc:116: /tmp/ray/session_2025-05-20_09-18-04_347477_13 is over 95% full, available space: 6.92839 GB; capacity: 1756.86 GB. Object creation will fail if spilling is required.
(RayWorkerWrapper pid=2612) ERROR 05-20 09:55:21 [worker_base.py:620] Error executing method 'load_model'. This might cause deadlock in distributed execution.
(RayWorkerWrapper pid=2612) ERROR 05-20 09:55:21 [worker_base.py:620] Traceback (most recent call last):
(RayWorkerWrapper pid=2612) ERROR 05-20 09:55:21 [worker_base.py:620] File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 612, in execute_method
(RayWorkerWrapper pid=2612) ERROR 05-20 09:55:21 [worker_base.py:620] return run_method(self, method, args, kwargs)
(RayWorkerWrapper pid=2612) ERROR 05-20 09:55:21 [worker_base.py:620] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(RayWorkerWrapper pid=2612) ERROR 05-20 09:55:21 [worker_base.py:620] File "/usr/local/lib/python3.12/dist-packages/vllm/utils.py", line 2645, in run_method
(RayWorkerWrapper pid=2612) ERROR 05-20 09:55:21 [worker_base.py:620] return func(*args, **kwargs)
(RayWorkerWrapper pid=2612) ERROR 05-20 09:55:21 [worker_base.py:620] ^^^^^^^^^^^^^^^^^^^^^
(RayWorkerWrapper pid=2612) ERROR 05-20 09:55:21 [worker_base.py:620] File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 231, in load_model
(RayWorkerWrapper pid=2612) ERROR 05-20 09:55:21 [worker_base.py:620] self.model_runner.load_model()
(RayWorkerWrapper pid=2612) ERROR 05-20 09:55:21 [worker_base.py:620] File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1164, in load_model
(RayWorkerWrapper pid=2612) ERROR 05-20 09:55:21 [worker_base.py:620] self.model = get_model(vllm_config=self.vllm_config)
(RayWorkerWrapper pid=2612) ERROR 05-20 09:55:21 [worker_base.py:620] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(RayWorkerWrapper pid=2612) ERROR 05-20 09:55:21 [worker_base.py:620] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/__init__.py", line 14, in get_model
(RayWorkerWrapper pid=2612) ERROR 05-20 09:55:21 [worker_base.py:620] return loader.load_model(vllm_config=vllm_config)
(RayWorkerWrapper pid=2612) ERROR 05-20 09:55:21 [worker_base.py:620] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(RayWorkerWrapper pid=2612) ERROR 05-20 09:55:21 [worker_base.py:620] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/loader.py", line 456, in load_model
(RayWorkerWrapper pid=2612) ERROR 05-20 09:55:21 [worker_base.py:620] loaded_weights = model.load_weights(
(RayWorkerWrapper pid=2612) ERROR 05-20 09:55:21 [worker_base.py:620] ^^^^^^^^^^^^^^^^^^^
(RayWorkerWrapper pid=2612) ERROR 05-20 09:55:21 [worker_base.py:620] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 791, in load_weights
(RayWorkerWrapper pid=2612) ERROR 05-20 09:55:21 [worker_base.py:620] param = params_dict[name]
(RayWorkerWrapper pid=2612) ERROR 05-20 09:55:21 [worker_base.py:620] ~~~~~~~~~~~^^^^^^
(RayWorkerWrapper pid=2612) ERROR 05-20 09:55:21 [worker_base.py:620] KeyError: 'model.layers.3.mlp.experts.w2_weight_scale'
(RayWorkerWrapper pid=2331) INFO 05-20 09:55:09 [utils.py:1205] Found nccl from library librccl.so.1 [repeated 29x across cluster]
(RayWorkerWrapper pid=2331) INFO 05-20 09:55:09 [pynccl.py:69] vLLM is using nccl==2.21.5 [repeated 29x across cluster]
(RayWorkerWrapper pid=13024, ip=10.41.18.47) INFO 05-20 09:55:09 [shm_broadcast.py:266] vLLM message queue communication handle: Handle(local_reader_ranks=[1], buffer_handle=(1, 4194304, 6, 'psm_3fdc3186'), local_subscribe_addr='ipc:///tmp/a7ad559d-6d6d-4f14-a50a-126b74c1af9b', remote_subscribe_addr=None, remote_addr_ipv6=False) [repeated 6x across cluster]
(RayWorkerWrapper pid=13023, ip=10.41.18.47) INFO 05-20 09:55:10 [parallel_state.py:1004] rank 13 in world size 16 is assigned as DP rank 0, PP rank 6, TP rank 1 [repeated 14x across cluster]
(RayWorkerWrapper pid=13023, ip=10.41.18.47) INFO 05-20 09:55:10 [model_runner.py:1161] Starting to load model /app/model/DeepSeek-R1-Channel-INT8/... [repeated 14x across cluster]
(RayWorkerWrapper pid=13023, ip=10.41.18.47) INFO 05-20 09:55:10 [utils.py:106] Hidden layers were unevenly partitioned: [7,7,8,8,8,8,8,7]. This can be manually overridden using the VLLM_PP_LAYER_PARTITION environment variable [repeated 14x across cluster]
(RayWorkerWrapper pid=13021, ip=10.41.18.47) [W520 09:55:08.461267040 socket.cpp:200] [c10d] The hostname of the client socket cannot be retrieved. err=-3 [repeated 7x across cluster]
(raylet, ip=10.41.18.47) [2025-05-20 09:55:21,875 E 112 147] (raylet) file_system_monitor.cc:116: /tmp/ray/session_2025-05-20_09-18-04_347477_13 is over 95% full, available space: 6.92836 GB; capacity: 1756.86 GB. Object creation will fail if spilling is required.
Loading safetensors checkpoint shards: 0% Completed | 0/163 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 2% Completed | 3/163 [00:00<00:07, 21.18it/s]
Loading safetensors checkpoint shards: 4% Completed | 6/163 [00:00<00:07, 20.97it/s]
ERROR 05-20 09:55:25 [worker_base.py:620] Error executing method 'load_model'. This might cause deadlock in distributed execution.
ERROR 05-20 09:55:25 [worker_base.py:620] Traceback (most recent call last):
ERROR 05-20 09:55:25 [worker_base.py:620] File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 612, in execute_method
ERROR 05-20 09:55:25 [worker_base.py:620] return run_method(self, method, args, kwargs)
ERROR 05-20 09:55:25 [worker_base.py:620] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-20 09:55:25 [worker_base.py:620] File "/usr/local/lib/python3.12/dist-packages/vllm/utils.py", line 2645, in run_method
ERROR 05-20 09:55:25 [worker_base.py:620] return func(*args, **kwargs)
ERROR 05-20 09:55:25 [worker_base.py:620] ^^^^^^^^^^^^^^^^^^^^^
ERROR 05-20 09:55:25 [worker_base.py:620] File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 231, in load_model
ERROR 05-20 09:55:25 [worker_base.py:620] self.model_runner.load_model()
ERROR 05-20 09:55:25 [worker_base.py:620] File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1164, in load_model
ERROR 05-20 09:55:25 [worker_base.py:620] self.model = get_model(vllm_config=self.vllm_config)
ERROR 05-20 09:55:25 [worker_base.py:620] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-20 09:55:25 [worker_base.py:620] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/__init__.py", line 14, in get_model
ERROR 05-20 09:55:25 [worker_base.py:620] return loader.load_model(vllm_config=vllm_config)
ERROR 05-20 09:55:25 [worker_base.py:620] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-20 09:55:25 [worker_base.py:620] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/loader.py", line 456, in load_model
ERROR 05-20 09:55:25 [worker_base.py:620] loaded_weights = model.load_weights(
ERROR 05-20 09:55:25 [worker_base.py:620] ^^^^^^^^^^^^^^^^^^^
ERROR 05-20 09:55:25 [worker_base.py:620] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 791, in load_weights
ERROR 05-20 09:55:25 [worker_base.py:620] param = params_dict[name]
ERROR 05-20 09:55:25 [worker_base.py:620] ~~~~~~~~~~~^^^^^^
ERROR 05-20 09:55:25 [worker_base.py:620] KeyError: 'model.layers.3.mlp.experts.w2_weight_scale'
[rank0]: Traceback (most recent call last):
[rank0]: File "/usr/local/bin/vllm", line 8, in <module>
[rank0]: sys.exit(main())
[rank0]: ^^^^^^
[rank0]: File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/main.py", line 53, in main
[rank0]: args.dispatch_function(args)
[rank0]: File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/serve.py", line 27, in cmd
[rank0]: uvloop.run(run_server(args))
[rank0]: File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 109, in run
[rank0]: return __asyncio.run(
[rank0]: ^^^^^^^^^^^^^^
[rank0]: File "/usr/lib/python3.12/asyncio/runners.py", line 195, in run
[rank0]: return runner.run(main)
[rank0]: ^^^^^^^^^^^^^^^^
[rank0]: File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
[rank0]: return self._loop.run_until_complete(task)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
[rank0]: File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 61, in wrapper
[rank0]: return await main
[rank0]: ^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 1077, in run_server
[rank0]: async with build_async_engine_client(args) as engine_client:
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
[rank0]: return await anext(self.gen)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 146, in build_async_engine_client
[rank0]: async with build_async_engine_client_from_engine_args(
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
[rank0]: return await anext(self.gen)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 194, in build_async_engine_client_from_engine_args
[rank0]: engine_client = AsyncLLMEngine.from_vllm_config(
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 661, in from_vllm_config
[rank0]: return cls(
[rank0]: ^^^^
[rank0]: File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 616, in __init__
[rank0]: self.engine = self._engine_class(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 267, in __init__
[rank0]: super().__init__(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 275, in __init__
[rank0]: self.model_executor = executor_class(vllm_config=vllm_config)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.12/dist-packages/vllm/executor/executor_base.py", line 286, in __init__
[rank0]: super().__init__(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.12/dist-packages/vllm/executor/executor_base.py", line 52, in __init__
[rank0]: self._init_executor()
[rank0]: File "/usr/local/lib/python3.12/dist-packages/vllm/executor/ray_distributed_executor.py", line 114, in _init_executor
[rank0]: self._init_workers_ray(placement_group)
[rank0]: File "/usr/local/lib/python3.12/dist-packages/vllm/executor/ray_distributed_executor.py", line 396, in _init_workers_ray
[rank0]: self._run_workers("load_model",
[rank0]: File "/usr/local/lib/python3.12/dist-packages/vllm/executor/ray_distributed_executor.py", line 516, in _run_workers
[rank0]: self.driver_worker.execute_method(sent_method, *args, **kwargs)
[rank0]: File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 621, in execute_method
[rank0]: raise e
[rank0]: File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 612, in execute_method
[rank0]: return run_method(self, method, args, kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.12/dist-packages/vllm/utils.py", line 2645, in run_method
[rank0]: return func(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 231, in load_model
[rank0]: self.model_runner.load_model()
[rank0]: File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1164, in load_model
[rank0]: self.model = get_model(vllm_config=self.vllm_config)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/__init__.py", line 14, in get_model
[rank0]: return loader.load_model(vllm_config=vllm_config)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/loader.py", line 456, in load_model
[rank0]: loaded_weights = model.load_weights(
[rank0]: ^^^^^^^^^^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 791, in load_weights
[rank0]: param = params_dict[name]
[rank0]: ~~~~~~~~~~~^^^^^^
[rank0]: KeyError: 'model.layers.3.mlp.experts.w2_weight_scale'
(RayWorkerWrapper pid=13023, ip=10.41.18.47) ERROR 05-20 09:55:26 [worker_base.py:620] Error executing method 'load_model'. This might cause deadlock in distributed execution. [repeated 8x across cluster]
(RayWorkerWrapper pid=13023, ip=10.41.18.47) ERROR 05-20 09:55:26 [worker_base.py:620] Traceback (most recent call last): [repeated 8x across cluster]
(RayWorkerWrapper pid=13023, ip=10.41.18.47) ERROR 05-20 09:55:26 [worker_base.py:620] File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 612, in execute_method [repeated 8x across cluster]
(RayWorkerWrapper pid=13023, ip=10.41.18.47) ERROR 05-20 09:55:26 [worker_base.py:620] return run_method(self, method, args, kwargs) [repeated 8x across cluster]
(RayWorkerWrapper pid=13023, ip=10.41.18.47) ERROR 05-20 09:55:26 [worker_base.py:620] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [repeated 8x across cluster]
(RayWorkerWrapper pid=13023, ip=10.41.18.47) ERROR 05-20 09:55:26 [worker_base.py:620] File "/usr/local/lib/python3.12/dist-packages/vllm/utils.py", line 2645, in run_method [repeated 8x across cluster]
(RayWorkerWrapper pid=13023, ip=10.41.18.47) ERROR 05-20 09:55:26 [worker_base.py:620] return func(*args, **kwargs) [repeated 8x across cluster]
(RayWorkerWrapper pid=13023, ip=10.41.18.47) ERROR 05-20 09:55:26 [worker_base.py:620] ^^^^^^^^^^^^^^^^^^^^^ [repeated 8x across cluster]
(RayWorkerWrapper pid=13023, ip=10.41.18.47) ERROR 05-20 09:55:26 [worker_base.py:620] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/loader.py", line 456, in load_model [repeated 24x across cluster]
(RayWorkerWrapper pid=13023, ip=10.41.18.47) ERROR 05-20 09:55:26 [worker_base.py:620] self.model_runner.load_model() [repeated 8x across cluster]
(RayWorkerWrapper pid=13023, ip=10.41.18.47) ERROR 05-20 09:55:26 [worker_base.py:620] self.model = get_model(vllm_config=self.vllm_config) [repeated 8x across cluster]
(RayWorkerWrapper pid=13023, ip=10.41.18.47) ERROR 05-20 09:55:26 [worker_base.py:620] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [repeated 8x across cluster]
(RayWorkerWrapper pid=13023, ip=10.41.18.47) ERROR 05-20 09:55:26 [worker_base.py:620] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/__init__.py", line 14, in get_model [repeated 8x across cluster]
(RayWorkerWrapper pid=13023, ip=10.41.18.47) ERROR 05-20 09:55:26 [worker_base.py:620] return loader.load_model(vllm_config=vllm_config) [repeated 8x across cluster]
(RayWorkerWrapper pid=13023, ip=10.41.18.47) ERROR 05-20 09:55:26 [worker_base.py:620] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [repeated 8x across cluster]
(RayWorkerWrapper pid=13023, ip=10.41.18.47) ERROR 05-20 09:55:26 [worker_base.py:620] loaded_weights = model.load_weights( [repeated 8x across cluster]
(RayWorkerWrapper pid=13023, ip=10.41.18.47) ERROR 05-20 09:55:26 [worker_base.py:620] ^^^^^^^^^^^^^^^^^^^ [repeated 8x across cluster]
(RayWorkerWrapper pid=13023, ip=10.41.18.47) ERROR 05-20 09:55:26 [worker_base.py:620] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 791, in load_weights [repeated 8x across cluster]
(RayWorkerWrapper pid=13023, ip=10.41.18.47) ERROR 05-20 09:55:26 [worker_base.py:620] param = params_dict[name] [repeated 8x across cluster]
(RayWorkerWrapper pid=13023, ip=10.41.18.47) ERROR 05-20 09:55:26 [worker_base.py:620] ~~~~~~~~~~~^^^^^^ [repeated 8x across cluster]
(RayWorkerWrapper pid=13023, ip=10.41.18.47) ERROR 05-20 09:55:26 [worker_base.py:620] KeyError: 'model.layers.53.mlp.experts.w2_weight_scale' [repeated 8x across cluster]
INFO 05-20 09:55:27 [ray_distributed_executor.py:127] Shutting down Ray distributed executor. If you see error log from logging.cc regarding SIGTERM received, please ignore because this is the expected termination process in Ray.
Loading safetensors checkpoint shards: 5% Completed | 8/163 [00:01<00:36, 4.22it/s]
/usr/lib/python3.12/multiprocessing/resource_tracker.py:279: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
[rank0]:[W520 09:55:27.268350623 ProcessGroupNCCL.cpp:1476] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.