Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Speculative decoding reports errors when loading target model using distributed inference (VLLM's offical Ray setup) #12841

Closed
1 task done
Neo9061 opened this issue Feb 6, 2025 · 11 comments · Fixed by #13269
Labels
bug Something isn't working

Comments

@Neo9061
Copy link

Neo9061 commented Feb 6, 2025

Your current environment

  1. vllm: open-ai latest container
  2. The ray cluster I set up is two nodes of 8 x H100. I setup the ray cluster, check ray status being okay, and run following python script within the container.
  3. I am doing offline distributed inference with official guided instruction using ray.
  4. I am able to successfully start the model with distributed inference without speculative decoding via VLLM class.
  5. Then when I try to pass in the speculative argument to the VLLM class and it reports error.

🐛 Describe the bug

Reproducible code is simply below. Note. If you remove the speculative decoding arguments, the model can be loaded successfully.

from vllm import LLM, SamplingParams

prompts = [
    "The future of AI is",
]
sampling_params = SamplingParams(temperature=0, max_tokens=512)

llm = LLM(
    model="meta-llama/Llama-3.1-70B-Instruct",
    tensor_parallel_size=16,
    speculative_model="meta-llama/Llama-3.2-1B-Instruct",
    speculative_draft_tensor_parallel_size=1,
    num_speculative_tokens=5,
    disable_log_stats=False,
    enforce_eager=True,
    trust_remote_code=True
)


import time
time.sleep(5)

outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")


The error message it gave is following.

INFO 02-06 08:03:56 model_runner.py:1111] Starting to load model /root/models/llama-3-3-70b/Llama-3.1-70B-Instruct...
^MLoading safetensors checkpoint shards:   0% Completed | 0/30 [00:00<?, ?it/s]
^MLoading safetensors checkpoint shards:   3% Completed | 1/30 [00:08<04:08,  8.58s/it]
^MLoading safetensors checkpoint shards:   7% Completed | 2/30 [00:17<04:10,  8.94s/it]
^MLoading safetensors checkpoint shards:  10% Completed | 3/30 [00:25<03:43,  8.29s/it]
^MLoading safetensors checkpoint shards:  13% Completed | 4/30 [00:35<03:57,  9.15s/it]
^MLoading safetensors checkpoint shards:  17% Completed | 5/30 [00:43<03:38,  8.75s/it]
^MLoading safetensors checkpoint shards:  20% Completed | 6/30 [00:53<03:38,  9.09s/it]
^MLoading safetensors checkpoint shards:  23% Completed | 7/30 [01:00<03:12,  8.38s/it]
^MLoading safetensors checkpoint shards:  27% Completed | 8/30 [01:08<03:03,  8.36s/it]
^MLoading safetensors checkpoint shards:  30% Completed | 9/30 [01:19<03:09,  9.00s/it]
^MLoading safetensors checkpoint shards:  33% Completed | 10/30 [01:27<02:56,  8.80s/it]
^MLoading safetensors checkpoint shards:  37% Completed | 11/30 [01:37<02:53,  9.11s/it]
^MLoading safetensors checkpoint shards:  40% Completed | 12/30 [01:45<02:37,  8.78s/it]
^MLoading safetensors checkpoint shards:  43% Completed | 13/30 [01:46<01:48,  6.35s/it]
^MLoading safetensors checkpoint shards:  47% Completed | 14/30 [01:56<01:58,  7.43s/it]
^MLoading safetensors checkpoint shards:  50% Completed | 15/30 [02:05<01:58,  7.92s/it]
^MLoading safetensors checkpoint shards:  53% Completed | 16/30 [02:12<01:48,  7.76s/it]
^MLoading safetensors checkpoint shards:  57% Completed | 17/30 [02:20<01:40,  7.71s/it]
^MLoading safetensors checkpoint shards:  60% Completed | 18/30 [02:29<01:38,  8.21s/it]
^MLoading safetensors checkpoint shards:  63% Completed | 19/30 [02:39<01:36,  8.75s/it]
^MLoading safetensors checkpoint shards:  67% Completed | 20/30 [02:48<01:29,  8.91s/it]
^MLoading safetensors checkpoint shards:  70% Completed | 21/30 [02:58<01:21,  9.02s/it]
^MLoading safetensors checkpoint shards:  73% Completed | 22/30 [03:07<01:12,  9.09s/it]
^MLoading safetensors checkpoint shards:  77% Completed | 23/30 [03:15<01:00,  8.71s/it]
^MLoading safetensors checkpoint shards:  80% Completed | 24/30 [03:24<00:53,  8.89s/it]
^MLoading safetensors checkpoint shards:  83% Completed | 25/30 [03:33<00:45,  9.01s/it]
^MLoading safetensors checkpoint shards:  87% Completed | 26/30 [03:41<00:33,  8.50s/it]
^MLoading safetensors checkpoint shards:  90% Completed | 27/30 [03:48<00:24,  8.06s/it]
^MLoading safetensors checkpoint shards:  93% Completed | 28/30 [03:57<00:16,  8.38s/it]
^MLoading safetensors checkpoint shards:  97% Completed | 29/30 [04:06<00:08,  8.55s/it]
^MLoading safetensors checkpoint shards: 100% Completed | 30/30 [04:14<00:00,  8.41s/it]
^MLoading safetensors checkpoint shards: 100% Completed | 30/30 [04:14<00:00,  8.48s/it]

INFO 02-06 08:08:11 model_runner.py:1116] Loading model weights took 8.4050 GB
INFO 02-06 08:08:11 model_runner.py:1111] Starting to load model /root/models/eagle-head/Llama-3.2-1B-Instruct...
^MLoading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
^MLoading safetensors checkpoint shards: 100% Completed | 1/1 [00:17<00:00, 17.94s/it]
^MLoading safetensors checkpoint shards: 100% Completed | 1/1 [00:17<00:00, 17.94s/it]

^[[36m(RayWorkerWrapper pid=14182, ip=172.31.18.145)^[[0m INFO 02-06 08:08:12 model_runner.py:1116] Loading model weights took 8.4050 GB
^[[36m(RayWorkerWrapper pid=14182, ip=172.31.18.145)^[[0m ERROR 02-06 08:08:12 worker_base.py:572] Error executing method 'init_device'. This might cause deadlock in distributed execution.
^[[36m(RayWorkerWrapper pid=14182, ip=172.31.18.145)^[[0m ERROR 02-06 08:08:12 worker_base.py:572] Traceback (most recent call last):
^[[36m(RayWorkerWrapper pid=14182, ip=172.31.18.145)^[[0m ERROR 02-06 08:08:12 worker_base.py:572]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 564, in execute_method
^[[36m(RayWorkerWrapper pid=14182, ip=172.31.18.145)^[[0m ERROR 02-06 08:08:12 worker_base.py:572]     return run_method(target, method, args, kwargs)
^[[36m(RayWorkerWrapper pid=14182, ip=172.31.18.145)^[[0m ERROR 02-06 08:08:12 worker_base.py:572]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^[[36m(RayWorkerWrapper pid=14182, ip=172.31.18.145)^[[0m ERROR 02-06 08:08:12 worker_base.py:572]   File "/usr/local/lib/python3.12/dist-packages/vllm/utils.py", line 2208, in run_method
^[[36m(RayWorkerWrapper pid=14182, ip=172.31.18.145)^[[0m ERROR 02-06 08:08:12 worker_base.py:572]     return func(*args, **kwargs)
^[[36m(RayWorkerWrapper pid=14182, ip=172.31.18.145)^[[0m ERROR 02-06 08:08:12 worker_base.py:572]            ^^^^^^^^^^^^^^^^^^^^^
^[[36m(RayWorkerWrapper pid=14182, ip=172.31.18.145)^[[0m ERROR 02-06 08:08:12 worker_base.py:572]   File "/usr/local/lib/python3.12/dist-packages/vllm/spec_decode/spec_decode_worker.py", line 329, in init_device
^[[36m(RayWorkerWrapper pid=14182, ip=172.31.18.145)^[[0m ERROR 02-06 08:08:12 worker_base.py:572]     self.spec_decode_sampler.init_tensors(self.rank,
^[[36m(RayWorkerWrapper pid=14182, ip=172.31.18.145)^[[0m ERROR 02-06 08:08:12 worker_base.py:572]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/spec_decode_base_sampler.py", line 54, in init_tensors
^[[36m(RayWorkerWrapper pid=14182, ip=172.31.18.145)^[[0m ERROR 02-06 08:08:12 worker_base.py:572]     self.num_accepted_tokens = torch.tensor(0,
^[[36m(RayWorkerWrapper pid=14182, ip=172.31.18.145)^[[0m ERROR 02-06 08:08:12 worker_base.py:572]                                ^^^^^^^^^^^^^^^
^[[36m(RayWorkerWrapper pid=14182, ip=172.31.18.145)^[[0m ERROR 02-06 08:08:12 worker_base.py:572] RuntimeError: CUDA error: invalid device ordinal
^[[36m(RayWorkerWrapper pid=14182, ip=172.31.18.145)^[[0m ERROR 02-06 08:08:12 worker_base.py:572] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
^[[36m(RayWorkerWrapper pid=14182, ip=172.31.18.145)^[[0m ERROR 02-06 08:08:12 worker_base.py:572] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
^[[36m(RayWorkerWrapper pid=14182, ip=172.31.18.145)^[[0m ERROR 02-06 08:08:12 worker_base.py:572] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
^[[36m(RayWorkerWrapper pid=14182, ip=172.31.18.145)^[[0m ERROR 02-06 08:08:12 worker_base.py:572]
^[[36m(RayWorkerWrapper pid=14179, ip=172.31.18.145)^[[0m WARNING 02-06 08:03:55 custom_all_reduce.py:82] Custom allreduce is disabled because this process group spans across nodes.^[[32m [repeated 14x across cluster]^[[0m
^[[36m(RayWorkerWrapper pid=14179, ip=172.31.18.145)^[[0m INFO 02-06 08:03:55 model_runner.py:1111] Starting to load model /root/models/llama-3-3-70b/Llama-3.1-70B-Instruct...^[[32m [repeated 14x across cluster]^[[0m
^[[36m(RayWorkerWrapper pid=2384)^[[0m INFO 02-06 08:08:12 spec_decode_worker.py:339] [Speculative Decoding] Use MQA scorer for scoring proposals.
INFO 02-06 08:08:29 model_runner.py:1116] Loading model weights took 2.3185 GB
INFO 02-06 08:08:29 spec_decode_worker.py:339] [Speculative Decoding] Use MQA scorer for scoring proposals.
[rank0]: Traceback (most recent call last):
[rank0]:   File "/vllm-workspace/run2.py", line 10, in <module>
[rank0]:     llm = LLM(
[rank0]:           ^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/vllm/utils.py", line 1039, in inner
[rank0]:     return fn(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/llm.py", line 240, in __init__
[rank0]:     self.llm_engine = self.engine_class.from_engine_args(
[rank0]:                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 482, in from_engine_args
[rank0]:     engine = cls(
[rank0]:              ^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 271, in __init__
[rank0]:     self.model_executor = executor_class(vllm_config=vllm_config, )
[rank0]:                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/executor_base.py", line 260, in __init__
[rank0]:     super().__init__(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/executor_base.py", line 49, in __init__
[rank0]:     self._init_executor()
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/ray_distributed_executor.py", line 88, in _init_executor
[rank0]:     self._init_workers_ray(placement_group)
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/ray_distributed_executor.py", line 343, in _init_workers_ray
[rank0]:     self._run_workers("init_device")
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/ray_distributed_executor.py", line 469, in _run_workers
[rank0]:     ray_worker_outputs = ray.get(ray_worker_outputs)
[rank0]:                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
[rank0]:     return fn(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
[rank0]:     return func(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/ray/_private/worker.py", line 2772, in get
[rank0]:     values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
[rank0]:                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/ray/_private/worker.py", line 919, in get_objects
[rank0]:     raise value.as_instanceof_cause()
[rank0]: ray.exceptions.RayTaskError(RuntimeError): ^[[36mray::RayWorkerWrapper.execute_method()^[[39m (pid=14177, ip=172.31.18.145, actor_id=ac50405ac53b631dcd36345f12000000, repr=<vllm.executor.ray_utils.RayWorkerWrapper object at 0x70f0dc40c200>)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 573, in execute_method
[rank0]:     raise e
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 564, in execute_method
[rank0]:     return run_method(target, method, args, kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/vllm/utils.py", line 2208, in run_method
[rank0]:     return func(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/vllm/spec_decode/spec_decode_worker.py", line 329, in init_device
[rank0]:     self.spec_decode_sampler.init_tensors(self.rank,
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/spec_decode_base_sampler.py", line 54, in init_tensors
[rank0]:     self.num_accepted_tokens = torch.tensor(0,
[rank0]:                                ^^^^^^^^^^^^^^^
[rank0]: RuntimeError: CUDA error: invalid device ordinal
[rank0]: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
[rank0]: For debugging consider passing CUDA_LAUNCH_BLOCKING=1
[rank0]: Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.                                                      

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
@Neo9061
Copy link
Author

Neo9061 commented Feb 8, 2025

Hi @zhuohan123 @youkaichao can I get some help please? This is for distributed deploying DeepSeek for Speculative decoding due to limited resources. The above code example is just for reproducibility using some smaller models.

@youkaichao
Copy link
Member

cc @LiuXiaoxuanPKU for spec decode.

@Neo9061
Copy link
Author

Neo9061 commented Feb 10, 2025

Hi @LiuXiaoxuanPKU any chance you can help find the root cause?

I also wanted to use the #12915 to deploy DeepSeek with sepculative decoding. But I only got two nodes of 8*H100. And hence need resolve the bug above. The minimal reproducible code is also provided above and it is not just for deepseek model.

@yangsijia-serena
Copy link
Contributor

Hi, I encounter the same error when trying to run #12915 with DeepSeek-R1 model.

My env:

  1. vllm: built based on [Model][Speculative Decoding] Add EAGLE-style MTP module reference code for DeepSeek-R1 #12915
  2. Ray cluster: two nodes of 8 x H20, ray status is right.

It's ok to run DeepSeek-R1 model without MTP feature in the same distributed env.

My startup command is

python3 \
  -m vllm.entrypoints.openai.api_server \
  --disable-log-requests \
  --gpu-memory-utilization 0.99\
  --quantization fp8 \
  --max-model-len 131072 \
  --seed 0 \
  --tensor-parallel-size 16 \
  --swap-space 0 \
  --model {model-path} \
  --trust-remote-code \
  --num-speculative-tokens 2 \
  --speculative-model DeepSeekV3MTP \
  --enforce-eager

The error detail is

ERROR 02-12 18:39:01 engine.py:389] ray::RayWorkerWrapper.execute_method() (pid=5115, ip=192.168.12.6, actor_id=2c462b3a885dc17cfeeb289d03000000, repr=<vllm.executor.ray_utils.RayWorkerWrapper object at 0x7fd2ce6f9330>)
ERROR 02-12 18:39:01 engine.py:389]   File "/root/vllm_0209/vllm/worker/worker_base.py", line 577, in execute_method
ERROR 02-12 18:39:01 engine.py:389]     raise e
ERROR 02-12 18:39:01 engine.py:389]   File "/root/vllm_0209/vllm/worker/worker_base.py", line 568, in execute_method
ERROR 02-12 18:39:01 engine.py:389]     return run_method(target, method, args, kwargs)
ERROR 02-12 18:39:01 engine.py:389]   File "/root/vllm_0209/vllm/utils.py", line 2220, in run_method
ERROR 02-12 18:39:01 engine.py:389]     return func(*args, **kwargs)
ERROR 02-12 18:39:01 engine.py:389]   File "/root/vllm_0209/vllm/spec_decode/spec_decode_worker.py", line 331, in init_device
ERROR 02-12 18:39:01 engine.py:389]     self.spec_decode_sampler.init_tensors(self.rank,
ERROR 02-12 18:39:01 engine.py:389]   File "/root/vllm_0209/vllm/model_executor/layers/spec_decode_base_sampler.py", line 56, in init_tensors
ERROR 02-12 18:39:01 engine.py:389]     self.num_accepted_tokens = torch.tensor(0,
ERROR 02-12 18:39:01 engine.py:389] RuntimeError: CUDA error: invalid device ordinal
ERROR 02-12 18:39:01 engine.py:389] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
ERROR 02-12 18:39:01 engine.py:389] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
ERROR 02-12 18:39:01 engine.py:389] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
ERROR 02-12 18:39:01 engine.py:389] Traceback (most recent call last):
ERROR 02-12 18:39:01 engine.py:389]   File "/root/vllm_0209/vllm/engine/multiprocessing/engine.py", line 380, in run_mp_engine
ERROR 02-12 18:39:01 engine.py:389]     engine = MQLLMEngine.from_engine_args(engine_args=engine_args,
ERROR 02-12 18:39:01 engine.py:389]   File "/root/vllm_0209/vllm/engine/multiprocessing/engine.py", line 123, in from_engine_args
ERROR 02-12 18:39:01 engine.py:389]     return cls(ipc_path=ipc_path,
ERROR 02-12 18:39:01 engine.py:389]   File "/root/vllm_0209/vllm/engine/multiprocessing/engine.py", line 75, in __init__
ERROR 02-12 18:39:01 engine.py:389]     self.engine = LLMEngine(*args, **kwargs)
ERROR 02-12 18:39:01 engine.py:389]   File "/root/vllm_0209/vllm/engine/llm_engine.py", line 273, in __init__
ERROR 02-12 18:39:01 engine.py:389]     self.model_executor = executor_class(vllm_config=vllm_config, )
ERROR 02-12 18:39:01 engine.py:389]   File "/root/vllm_0209/vllm/executor/executor_base.py", line 262, in __init__
ERROR 02-12 18:39:01 engine.py:389]     super().__init__(*args, **kwargs)
ERROR 02-12 18:39:01 engine.py:389]   File "/root/vllm_0209/vllm/executor/executor_base.py", line 51, in __init__
ERROR 02-12 18:39:01 engine.py:389]     self._init_executor()
ERROR 02-12 18:39:01 engine.py:389]   File "/root/vllm_0209/vllm/executor/ray_distributed_executor.py", line 90, in _init_executor
ERROR 02-12 18:39:01 engine.py:389]     self._init_workers_ray(placement_group)
ERROR 02-12 18:39:01 engine.py:389]   File "/root/vllm_0209/vllm/executor/ray_distributed_executor.py", line 355, in _init_workers_ray
ERROR 02-12 18:39:01 engine.py:389]     self._run_workers("init_device")
ERROR 02-12 18:39:01 engine.py:389]   File "/root/vllm_0209/vllm/executor/ray_distributed_executor.py", line 481, in _run_workers
ERROR 02-12 18:39:01 engine.py:389]     ray_worker_outputs = ray.get(ray_worker_outputs)
ERROR 02-12 18:39:01 engine.py:389]   File "/usr/local/lib/python3.10/dist-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
ERROR 02-12 18:39:01 engine.py:389]     return fn(*args, **kwargs)
ERROR 02-12 18:39:01 engine.py:389]   File "/usr/local/lib/python3.10/dist-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
ERROR 02-12 18:39:01 engine.py:389]     return func(*args, **kwargs)
ERROR 02-12 18:39:01 engine.py:389]   File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 2745, in get
ERROR 02-12 18:39:01 engine.py:389]     values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
ERROR 02-12 18:39:01 engine.py:389]   File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 901, in get_objects
ERROR 02-12 18:39:01 engine.py:389]     raise value.as_instanceof_cause()
ERROR 02-12 18:39:01 engine.py:389] ray.exceptions.RayTaskError(RuntimeError): ray::RayWorkerWrapper.execute_method() (pid=5115, ip=192.168.12.6, actor_id=2c462b3a885dc17cfeeb289d03000000, repr=<vllm.executor.ray_utils.RayWorkerWrapper object at 0x7fd2ce6f9330>)
ERROR 02-12 18:39:01 engine.py:389]   File "/root/vllm_0209/vllm/worker/worker_base.py", line 577, in execute_method
ERROR 02-12 18:39:01 engine.py:389]     raise e
ERROR 02-12 18:39:01 engine.py:389]   File "/root/vllm_0209/vllm/worker/worker_base.py", line 568, in execute_method
ERROR 02-12 18:39:01 engine.py:389]     return run_method(target, method, args, kwargs)
ERROR 02-12 18:39:01 engine.py:389]   File "/root/vllm_0209/vllm/utils.py", line 2220, in run_method
ERROR 02-12 18:39:01 engine.py:389]     return func(*args, **kwargs)
ERROR 02-12 18:39:01 engine.py:389]   File "/root/vllm_0209/vllm/spec_decode/spec_decode_worker.py", line 331, in init_device
ERROR 02-12 18:39:01 engine.py:389]     self.spec_decode_sampler.init_tensors(self.rank,
ERROR 02-12 18:39:01 engine.py:389]   File "/root/vllm_0209/vllm/model_executor/layers/spec_decode_base_sampler.py", line 56, in init_tensors
ERROR 02-12 18:39:01 engine.py:389]     self.num_accepted_tokens = torch.tensor(0,
ERROR 02-12 18:39:01 engine.py:389] RuntimeError: CUDA error: invalid device ordinal
ERROR 02-12 18:39:01 engine.py:389] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
ERROR 02-12 18:39:01 engine.py:389] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
ERROR 02-12 18:39:01 engine.py:389] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Process SpawnProcess-1:
Traceback (most recent call last):
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/root/vllm_0209/vllm/engine/multiprocessing/engine.py", line 391, in run_mp_engine
    raise e
  File "/root/vllm_0209/vllm/engine/multiprocessing/engine.py", line 380, in run_mp_engine
    engine = MQLLMEngine.from_engine_args(engine_args=engine_args,
  File "/root/vllm_0209/vllm/engine/multiprocessing/engine.py", line 123, in from_engine_args
    return cls(ipc_path=ipc_path,
  File "/root/vllm_0209/vllm/engine/multiprocessing/engine.py", line 75, in __init__
    self.engine = LLMEngine(*args, **kwargs)
  File "/root/vllm_0209/vllm/engine/llm_engine.py", line 273, in __init__
    self.model_executor = executor_class(vllm_config=vllm_config, )
  File "/root/vllm_0209/vllm/executor/executor_base.py", line 262, in __init__
    super().__init__(*args, **kwargs)
  File "/root/vllm_0209/vllm/executor/executor_base.py", line 51, in __init__
    self._init_executor()
  File "/root/vllm_0209/vllm/executor/ray_distributed_executor.py", line 90, in _init_executor
    self._init_workers_ray(placement_group)
  File "/root/vllm_0209/vllm/executor/ray_distributed_executor.py", line 355, in _init_workers_ray
    self._run_workers("init_device")
  File "/root/vllm_0209/vllm/executor/ray_distributed_executor.py", line 481, in _run_workers
    ray_worker_outputs = ray.get(ray_worker_outputs)
  File "/usr/local/lib/python3.10/dist-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 2745, in get
    values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
  File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 901, in get_objects
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(RuntimeError): ray::RayWorkerWrapper.execute_method() (pid=5115, ip=192.168.12.6, actor_id=2c462b3a885dc17cfeeb289d03000000, repr=<vllm.executor.ray_utils.RayWorkerWrapper object at 0x7fd2ce6f9330>)
  File "/root/vllm_0209/vllm/worker/worker_base.py", line 577, in execute_method
    raise e
  File "/root/vllm_0209/vllm/worker/worker_base.py", line 568, in execute_method
    return run_method(target, method, args, kwargs)
  File "/root/vllm_0209/vllm/utils.py", line 2220, in run_method
    return func(*args, **kwargs)
  File "/root/vllm_0209/vllm/spec_decode/spec_decode_worker.py", line 331, in init_device
    self.spec_decode_sampler.init_tensors(self.rank,
  File "/root/vllm_0209/vllm/model_executor/layers/spec_decode_base_sampler.py", line 56, in init_tensors
    self.num_accepted_tokens = torch.tensor(0,
RuntimeError: CUDA error: invalid device ordinal

Looking forward to any suggestions or guidances, thanks a lot!

@ShangmingCai
Copy link
Contributor

@Neo9061

The root cause is these lines:

if isinstance(device, int):
device = f"{device_type}:{device}"
self.num_accepted_tokens = torch.tensor(0,
dtype=torch.long,
device=device)
self.num_emitted_tokens = torch.tensor(0,
dtype=torch.long,
device=device)

When we use multi-node inferencing with tp bigger than 8, the device is not transformed correctly from int.

You can make it work around by manually changing the code from device=device to device="cuda" in both nodes. If you are using the official run_cluster.sh solution, you need to enter both containers to change the code.

@Neo9061
Copy link
Author

Neo9061 commented Feb 14, 2025

@Neo9061

The root cause is these lines:

vllm/vllm/model_executor/layers/spec_decode_base_sampler.py

Lines 54 to 62 in 84683fa

if isinstance(device, int):
device = f"{device_type}:{device}"
self.num_accepted_tokens = torch.tensor(0,
dtype=torch.long,
device=device)
self.num_emitted_tokens = torch.tensor(0,
dtype=torch.long,
device=device)

When we use multi-node inferencing with tp bigger than 8, the device is not transformed correctly from int.

You can make it work around by manually changing the code from device=device to device="cuda" in both nodes. If you are using the official run_cluster.sh solution, you need to enter both containers to change the code.

Got it thank you so much! Let me try and experiment.

@QualityGN
Copy link

Hi, I encounter the same error when trying to run #12915 with DeepSeek-R1 model.

My env:

  1. vllm: built based on [Model][Speculative Decoding] Add EAGLE-style MTP module reference code for DeepSeek-R1 #12915
  2. Ray cluster: two nodes of 8 x H20, ray status is right.

It's ok to run DeepSeek-R1 model without MTP feature in the same distributed env.

My startup command is

python3 \
  -m vllm.entrypoints.openai.api_server \
  --disable-log-requests \
  --gpu-memory-utilization 0.99\
  --quantization fp8 \
  --max-model-len 131072 \
  --seed 0 \
  --tensor-parallel-size 16 \
  --swap-space 0 \
  --model {model-path} \
  --trust-remote-code \
  --num-speculative-tokens 2 \
  --speculative-model DeepSeekV3MTP \
  --enforce-eager

The error detail is

ERROR 02-12 18:39:01 engine.py:389] ray::RayWorkerWrapper.execute_method() (pid=5115, ip=192.168.12.6, actor_id=2c462b3a885dc17cfeeb289d03000000, repr=<vllm.executor.ray_utils.RayWorkerWrapper object at 0x7fd2ce6f9330>)
ERROR 02-12 18:39:01 engine.py:389]   File "/root/vllm_0209/vllm/worker/worker_base.py", line 577, in execute_method
ERROR 02-12 18:39:01 engine.py:389]     raise e
ERROR 02-12 18:39:01 engine.py:389]   File "/root/vllm_0209/vllm/worker/worker_base.py", line 568, in execute_method
ERROR 02-12 18:39:01 engine.py:389]     return run_method(target, method, args, kwargs)
ERROR 02-12 18:39:01 engine.py:389]   File "/root/vllm_0209/vllm/utils.py", line 2220, in run_method
ERROR 02-12 18:39:01 engine.py:389]     return func(*args, **kwargs)
ERROR 02-12 18:39:01 engine.py:389]   File "/root/vllm_0209/vllm/spec_decode/spec_decode_worker.py", line 331, in init_device
ERROR 02-12 18:39:01 engine.py:389]     self.spec_decode_sampler.init_tensors(self.rank,
ERROR 02-12 18:39:01 engine.py:389]   File "/root/vllm_0209/vllm/model_executor/layers/spec_decode_base_sampler.py", line 56, in init_tensors
ERROR 02-12 18:39:01 engine.py:389]     self.num_accepted_tokens = torch.tensor(0,
ERROR 02-12 18:39:01 engine.py:389] RuntimeError: CUDA error: invalid device ordinal
ERROR 02-12 18:39:01 engine.py:389] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
ERROR 02-12 18:39:01 engine.py:389] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
ERROR 02-12 18:39:01 engine.py:389] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
ERROR 02-12 18:39:01 engine.py:389] Traceback (most recent call last):
ERROR 02-12 18:39:01 engine.py:389]   File "/root/vllm_0209/vllm/engine/multiprocessing/engine.py", line 380, in run_mp_engine
ERROR 02-12 18:39:01 engine.py:389]     engine = MQLLMEngine.from_engine_args(engine_args=engine_args,
ERROR 02-12 18:39:01 engine.py:389]   File "/root/vllm_0209/vllm/engine/multiprocessing/engine.py", line 123, in from_engine_args
ERROR 02-12 18:39:01 engine.py:389]     return cls(ipc_path=ipc_path,
ERROR 02-12 18:39:01 engine.py:389]   File "/root/vllm_0209/vllm/engine/multiprocessing/engine.py", line 75, in __init__
ERROR 02-12 18:39:01 engine.py:389]     self.engine = LLMEngine(*args, **kwargs)
ERROR 02-12 18:39:01 engine.py:389]   File "/root/vllm_0209/vllm/engine/llm_engine.py", line 273, in __init__
ERROR 02-12 18:39:01 engine.py:389]     self.model_executor = executor_class(vllm_config=vllm_config, )
ERROR 02-12 18:39:01 engine.py:389]   File "/root/vllm_0209/vllm/executor/executor_base.py", line 262, in __init__
ERROR 02-12 18:39:01 engine.py:389]     super().__init__(*args, **kwargs)
ERROR 02-12 18:39:01 engine.py:389]   File "/root/vllm_0209/vllm/executor/executor_base.py", line 51, in __init__
ERROR 02-12 18:39:01 engine.py:389]     self._init_executor()
ERROR 02-12 18:39:01 engine.py:389]   File "/root/vllm_0209/vllm/executor/ray_distributed_executor.py", line 90, in _init_executor
ERROR 02-12 18:39:01 engine.py:389]     self._init_workers_ray(placement_group)
ERROR 02-12 18:39:01 engine.py:389]   File "/root/vllm_0209/vllm/executor/ray_distributed_executor.py", line 355, in _init_workers_ray
ERROR 02-12 18:39:01 engine.py:389]     self._run_workers("init_device")
ERROR 02-12 18:39:01 engine.py:389]   File "/root/vllm_0209/vllm/executor/ray_distributed_executor.py", line 481, in _run_workers
ERROR 02-12 18:39:01 engine.py:389]     ray_worker_outputs = ray.get(ray_worker_outputs)
ERROR 02-12 18:39:01 engine.py:389]   File "/usr/local/lib/python3.10/dist-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
ERROR 02-12 18:39:01 engine.py:389]     return fn(*args, **kwargs)
ERROR 02-12 18:39:01 engine.py:389]   File "/usr/local/lib/python3.10/dist-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
ERROR 02-12 18:39:01 engine.py:389]     return func(*args, **kwargs)
ERROR 02-12 18:39:01 engine.py:389]   File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 2745, in get
ERROR 02-12 18:39:01 engine.py:389]     values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
ERROR 02-12 18:39:01 engine.py:389]   File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 901, in get_objects
ERROR 02-12 18:39:01 engine.py:389]     raise value.as_instanceof_cause()
ERROR 02-12 18:39:01 engine.py:389] ray.exceptions.RayTaskError(RuntimeError): ray::RayWorkerWrapper.execute_method() (pid=5115, ip=192.168.12.6, actor_id=2c462b3a885dc17cfeeb289d03000000, repr=<vllm.executor.ray_utils.RayWorkerWrapper object at 0x7fd2ce6f9330>)
ERROR 02-12 18:39:01 engine.py:389]   File "/root/vllm_0209/vllm/worker/worker_base.py", line 577, in execute_method
ERROR 02-12 18:39:01 engine.py:389]     raise e
ERROR 02-12 18:39:01 engine.py:389]   File "/root/vllm_0209/vllm/worker/worker_base.py", line 568, in execute_method
ERROR 02-12 18:39:01 engine.py:389]     return run_method(target, method, args, kwargs)
ERROR 02-12 18:39:01 engine.py:389]   File "/root/vllm_0209/vllm/utils.py", line 2220, in run_method
ERROR 02-12 18:39:01 engine.py:389]     return func(*args, **kwargs)
ERROR 02-12 18:39:01 engine.py:389]   File "/root/vllm_0209/vllm/spec_decode/spec_decode_worker.py", line 331, in init_device
ERROR 02-12 18:39:01 engine.py:389]     self.spec_decode_sampler.init_tensors(self.rank,
ERROR 02-12 18:39:01 engine.py:389]   File "/root/vllm_0209/vllm/model_executor/layers/spec_decode_base_sampler.py", line 56, in init_tensors
ERROR 02-12 18:39:01 engine.py:389]     self.num_accepted_tokens = torch.tensor(0,
ERROR 02-12 18:39:01 engine.py:389] RuntimeError: CUDA error: invalid device ordinal
ERROR 02-12 18:39:01 engine.py:389] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
ERROR 02-12 18:39:01 engine.py:389] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
ERROR 02-12 18:39:01 engine.py:389] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Process SpawnProcess-1:
Traceback (most recent call last):
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/root/vllm_0209/vllm/engine/multiprocessing/engine.py", line 391, in run_mp_engine
    raise e
  File "/root/vllm_0209/vllm/engine/multiprocessing/engine.py", line 380, in run_mp_engine
    engine = MQLLMEngine.from_engine_args(engine_args=engine_args,
  File "/root/vllm_0209/vllm/engine/multiprocessing/engine.py", line 123, in from_engine_args
    return cls(ipc_path=ipc_path,
  File "/root/vllm_0209/vllm/engine/multiprocessing/engine.py", line 75, in __init__
    self.engine = LLMEngine(*args, **kwargs)
  File "/root/vllm_0209/vllm/engine/llm_engine.py", line 273, in __init__
    self.model_executor = executor_class(vllm_config=vllm_config, )
  File "/root/vllm_0209/vllm/executor/executor_base.py", line 262, in __init__
    super().__init__(*args, **kwargs)
  File "/root/vllm_0209/vllm/executor/executor_base.py", line 51, in __init__
    self._init_executor()
  File "/root/vllm_0209/vllm/executor/ray_distributed_executor.py", line 90, in _init_executor
    self._init_workers_ray(placement_group)
  File "/root/vllm_0209/vllm/executor/ray_distributed_executor.py", line 355, in _init_workers_ray
    self._run_workers("init_device")
  File "/root/vllm_0209/vllm/executor/ray_distributed_executor.py", line 481, in _run_workers
    ray_worker_outputs = ray.get(ray_worker_outputs)
  File "/usr/local/lib/python3.10/dist-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 2745, in get
    values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
  File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 901, in get_objects
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(RuntimeError): ray::RayWorkerWrapper.execute_method() (pid=5115, ip=192.168.12.6, actor_id=2c462b3a885dc17cfeeb289d03000000, repr=<vllm.executor.ray_utils.RayWorkerWrapper object at 0x7fd2ce6f9330>)
  File "/root/vllm_0209/vllm/worker/worker_base.py", line 577, in execute_method
    raise e
  File "/root/vllm_0209/vllm/worker/worker_base.py", line 568, in execute_method
    return run_method(target, method, args, kwargs)
  File "/root/vllm_0209/vllm/utils.py", line 2220, in run_method
    return func(*args, **kwargs)
  File "/root/vllm_0209/vllm/spec_decode/spec_decode_worker.py", line 331, in init_device
    self.spec_decode_sampler.init_tensors(self.rank,
  File "/root/vllm_0209/vllm/model_executor/layers/spec_decode_base_sampler.py", line 56, in init_tensors
    self.num_accepted_tokens = torch.tensor(0,
RuntimeError: CUDA error: invalid device ordinal

Looking forward to any suggestions or guidances, thanks a lot!

Hi, have you replicated the inference acceleration effect after enabling MTP on multiple nodes?I have the same env(2 * 8 * H20) and mtp implemention but got low throughput(about 8.5 in average) and long scoring_time(100+ ms).

@yangchou19
Copy link

Hi, I encounter the same error when trying to run #12915 with DeepSeek-R1 model.

My env:

  1. vllm: built based on [Model][Speculative Decoding] Add EAGLE-style MTP module reference code for DeepSeek-R1 #12915
  2. Ray cluster: two nodes of 8 x H20, ray status is right.

It's ok to run DeepSeek-R1 model without MTP feature in the same distributed env.

My startup command is

python3 \
  -m vllm.entrypoints.openai.api_server \
  --disable-log-requests \
  --gpu-memory-utilization 0.99\
  --quantization fp8 \
  --max-model-len 131072 \
  --seed 0 \
  --tensor-parallel-size 16 \
  --swap-space 0 \
  --model {model-path} \
  --trust-remote-code \
  --num-speculative-tokens 2 \
  --speculative-model DeepSeekV3MTP \
  --enforce-eager

The error detail is

ERROR 02-12 18:39:01 engine.py:389] ray::RayWorkerWrapper.execute_method() (pid=5115, ip=192.168.12.6, actor_id=2c462b3a885dc17cfeeb289d03000000, repr=<vllm.executor.ray_utils.RayWorkerWrapper object at 0x7fd2ce6f9330>)
ERROR 02-12 18:39:01 engine.py:389]   File "/root/vllm_0209/vllm/worker/worker_base.py", line 577, in execute_method
ERROR 02-12 18:39:01 engine.py:389]     raise e
ERROR 02-12 18:39:01 engine.py:389]   File "/root/vllm_0209/vllm/worker/worker_base.py", line 568, in execute_method
ERROR 02-12 18:39:01 engine.py:389]     return run_method(target, method, args, kwargs)
ERROR 02-12 18:39:01 engine.py:389]   File "/root/vllm_0209/vllm/utils.py", line 2220, in run_method
ERROR 02-12 18:39:01 engine.py:389]     return func(*args, **kwargs)
ERROR 02-12 18:39:01 engine.py:389]   File "/root/vllm_0209/vllm/spec_decode/spec_decode_worker.py", line 331, in init_device
ERROR 02-12 18:39:01 engine.py:389]     self.spec_decode_sampler.init_tensors(self.rank,
ERROR 02-12 18:39:01 engine.py:389]   File "/root/vllm_0209/vllm/model_executor/layers/spec_decode_base_sampler.py", line 56, in init_tensors
ERROR 02-12 18:39:01 engine.py:389]     self.num_accepted_tokens = torch.tensor(0,
ERROR 02-12 18:39:01 engine.py:389] RuntimeError: CUDA error: invalid device ordinal
ERROR 02-12 18:39:01 engine.py:389] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
ERROR 02-12 18:39:01 engine.py:389] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
ERROR 02-12 18:39:01 engine.py:389] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
ERROR 02-12 18:39:01 engine.py:389] Traceback (most recent call last):
ERROR 02-12 18:39:01 engine.py:389]   File "/root/vllm_0209/vllm/engine/multiprocessing/engine.py", line 380, in run_mp_engine
ERROR 02-12 18:39:01 engine.py:389]     engine = MQLLMEngine.from_engine_args(engine_args=engine_args,
ERROR 02-12 18:39:01 engine.py:389]   File "/root/vllm_0209/vllm/engine/multiprocessing/engine.py", line 123, in from_engine_args
ERROR 02-12 18:39:01 engine.py:389]     return cls(ipc_path=ipc_path,
ERROR 02-12 18:39:01 engine.py:389]   File "/root/vllm_0209/vllm/engine/multiprocessing/engine.py", line 75, in __init__
ERROR 02-12 18:39:01 engine.py:389]     self.engine = LLMEngine(*args, **kwargs)
ERROR 02-12 18:39:01 engine.py:389]   File "/root/vllm_0209/vllm/engine/llm_engine.py", line 273, in __init__
ERROR 02-12 18:39:01 engine.py:389]     self.model_executor = executor_class(vllm_config=vllm_config, )
ERROR 02-12 18:39:01 engine.py:389]   File "/root/vllm_0209/vllm/executor/executor_base.py", line 262, in __init__
ERROR 02-12 18:39:01 engine.py:389]     super().__init__(*args, **kwargs)
ERROR 02-12 18:39:01 engine.py:389]   File "/root/vllm_0209/vllm/executor/executor_base.py", line 51, in __init__
ERROR 02-12 18:39:01 engine.py:389]     self._init_executor()
ERROR 02-12 18:39:01 engine.py:389]   File "/root/vllm_0209/vllm/executor/ray_distributed_executor.py", line 90, in _init_executor
ERROR 02-12 18:39:01 engine.py:389]     self._init_workers_ray(placement_group)
ERROR 02-12 18:39:01 engine.py:389]   File "/root/vllm_0209/vllm/executor/ray_distributed_executor.py", line 355, in _init_workers_ray
ERROR 02-12 18:39:01 engine.py:389]     self._run_workers("init_device")
ERROR 02-12 18:39:01 engine.py:389]   File "/root/vllm_0209/vllm/executor/ray_distributed_executor.py", line 481, in _run_workers
ERROR 02-12 18:39:01 engine.py:389]     ray_worker_outputs = ray.get(ray_worker_outputs)
ERROR 02-12 18:39:01 engine.py:389]   File "/usr/local/lib/python3.10/dist-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
ERROR 02-12 18:39:01 engine.py:389]     return fn(*args, **kwargs)
ERROR 02-12 18:39:01 engine.py:389]   File "/usr/local/lib/python3.10/dist-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
ERROR 02-12 18:39:01 engine.py:389]     return func(*args, **kwargs)
ERROR 02-12 18:39:01 engine.py:389]   File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 2745, in get
ERROR 02-12 18:39:01 engine.py:389]     values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
ERROR 02-12 18:39:01 engine.py:389]   File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 901, in get_objects
ERROR 02-12 18:39:01 engine.py:389]     raise value.as_instanceof_cause()
ERROR 02-12 18:39:01 engine.py:389] ray.exceptions.RayTaskError(RuntimeError): ray::RayWorkerWrapper.execute_method() (pid=5115, ip=192.168.12.6, actor_id=2c462b3a885dc17cfeeb289d03000000, repr=<vllm.executor.ray_utils.RayWorkerWrapper object at 0x7fd2ce6f9330>)
ERROR 02-12 18:39:01 engine.py:389]   File "/root/vllm_0209/vllm/worker/worker_base.py", line 577, in execute_method
ERROR 02-12 18:39:01 engine.py:389]     raise e
ERROR 02-12 18:39:01 engine.py:389]   File "/root/vllm_0209/vllm/worker/worker_base.py", line 568, in execute_method
ERROR 02-12 18:39:01 engine.py:389]     return run_method(target, method, args, kwargs)
ERROR 02-12 18:39:01 engine.py:389]   File "/root/vllm_0209/vllm/utils.py", line 2220, in run_method
ERROR 02-12 18:39:01 engine.py:389]     return func(*args, **kwargs)
ERROR 02-12 18:39:01 engine.py:389]   File "/root/vllm_0209/vllm/spec_decode/spec_decode_worker.py", line 331, in init_device
ERROR 02-12 18:39:01 engine.py:389]     self.spec_decode_sampler.init_tensors(self.rank,
ERROR 02-12 18:39:01 engine.py:389]   File "/root/vllm_0209/vllm/model_executor/layers/spec_decode_base_sampler.py", line 56, in init_tensors
ERROR 02-12 18:39:01 engine.py:389]     self.num_accepted_tokens = torch.tensor(0,
ERROR 02-12 18:39:01 engine.py:389] RuntimeError: CUDA error: invalid device ordinal
ERROR 02-12 18:39:01 engine.py:389] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
ERROR 02-12 18:39:01 engine.py:389] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
ERROR 02-12 18:39:01 engine.py:389] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Process SpawnProcess-1:
Traceback (most recent call last):
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/root/vllm_0209/vllm/engine/multiprocessing/engine.py", line 391, in run_mp_engine
    raise e
  File "/root/vllm_0209/vllm/engine/multiprocessing/engine.py", line 380, in run_mp_engine
    engine = MQLLMEngine.from_engine_args(engine_args=engine_args,
  File "/root/vllm_0209/vllm/engine/multiprocessing/engine.py", line 123, in from_engine_args
    return cls(ipc_path=ipc_path,
  File "/root/vllm_0209/vllm/engine/multiprocessing/engine.py", line 75, in __init__
    self.engine = LLMEngine(*args, **kwargs)
  File "/root/vllm_0209/vllm/engine/llm_engine.py", line 273, in __init__
    self.model_executor = executor_class(vllm_config=vllm_config, )
  File "/root/vllm_0209/vllm/executor/executor_base.py", line 262, in __init__
    super().__init__(*args, **kwargs)
  File "/root/vllm_0209/vllm/executor/executor_base.py", line 51, in __init__
    self._init_executor()
  File "/root/vllm_0209/vllm/executor/ray_distributed_executor.py", line 90, in _init_executor
    self._init_workers_ray(placement_group)
  File "/root/vllm_0209/vllm/executor/ray_distributed_executor.py", line 355, in _init_workers_ray
    self._run_workers("init_device")
  File "/root/vllm_0209/vllm/executor/ray_distributed_executor.py", line 481, in _run_workers
    ray_worker_outputs = ray.get(ray_worker_outputs)
  File "/usr/local/lib/python3.10/dist-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 2745, in get
    values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
  File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 901, in get_objects
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(RuntimeError): ray::RayWorkerWrapper.execute_method() (pid=5115, ip=192.168.12.6, actor_id=2c462b3a885dc17cfeeb289d03000000, repr=<vllm.executor.ray_utils.RayWorkerWrapper object at 0x7fd2ce6f9330>)
  File "/root/vllm_0209/vllm/worker/worker_base.py", line 577, in execute_method
    raise e
  File "/root/vllm_0209/vllm/worker/worker_base.py", line 568, in execute_method
    return run_method(target, method, args, kwargs)
  File "/root/vllm_0209/vllm/utils.py", line 2220, in run_method
    return func(*args, **kwargs)
  File "/root/vllm_0209/vllm/spec_decode/spec_decode_worker.py", line 331, in init_device
    self.spec_decode_sampler.init_tensors(self.rank,
  File "/root/vllm_0209/vllm/model_executor/layers/spec_decode_base_sampler.py", line 56, in init_tensors
    self.num_accepted_tokens = torch.tensor(0,
RuntimeError: CUDA error: invalid device ordinal

Looking forward to any suggestions or guidances, thanks a lot!

Hello, what speculative-model do you use when using MTP, and how can I get it?

@ShangmingCai
Copy link
Contributor

@yangchou19 "deepseek-ai/DeepSeek-R1".

@yangchou19
Copy link

@yangchou19 "deepseek-ai/DeepSeek-R1".

Thank you, the command --speculative-model DeepSeekV3MTP \ indicates that only the MTP weights are being loaded.

@ShangmingCai
Copy link
Contributor

@yangchou19 "deepseek-ai/DeepSeek-R1".

Thank you, the command --speculative-model DeepSeekV3MTP \ indicates that only the MTP weights are being loaded.

Actually, you don't need to set up --speculative-model for deepseek MTP, please refer to the recommended usage in #12755 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
6 participants