data-parallel evaluation cause an error of no GPU available #245

BaohaoLiao · 2025-02-08T16:14:48Z

I tried to use the data-parallel evaluation setting, i.e.

NUM_GPUS=2
MODEL=deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
MODEL_ARGS="pretrained=$MODEL,dtype=bfloat16,data_parallel_size=$NUM_GPUS,max_model_length=32768,gpu_memory_utilisation=0.8"
TASK=aime24
OUTPUT_DIR=data/evals/$MODEL

lighteval vllm $MODEL_ARGS "custom|$TASK|0|0" \
    --custom-tasks src/open_r1/evaluate.py \
    --use-chat-template \
    --output-dir $OUTPUT_DIR

But it causes the following error, since it doesn't find any available GPUS. When I chaneg the data_parallel_size to 1, it works.

RayTaskError(ValueError): [36mray::run_inference_one_model()[39m (pid=51424, ip=10.141.17.198)
  File "/opt/conda/lib/python3.10/site-packages/lighteval/models/vllm/vllm_model.py", line 339, in run_inference_one_model
    llm = LLM(**model_args)
  File "/opt/conda/lib/python3.10/site-packages/vllm/utils.py", line 1028, in inner
    return fn(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 210, in __init__
    self.llm_engine = self.engine_class.from_engine_args(
  File "/opt/conda/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 583, in from_engine_args
    executor_class = cls._get_executor_cls(engine_config)
  File "/opt/conda/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 558, in _get_executor_cls
    initialize_ray_cluster(engine_config.parallel_config)
  File "/opt/conda/lib/python3.10/site-packages/vllm/executor/ray_utils.py", line 296, in initialize_ray_cluster
    raise ValueError(
ValueError: Current node has no GPU available. current_node_resource={'node:__internal_head__': 1.0, 'node:10.141.17.198': 1.0, 'CPU': 6.0, 
'object_store_memory': 16320871833.0, 'memory': 182352875110.0, 'accelerator_type:A100': 1.0}. vLLM engine cannot start without GPU. Make 
sure you have at least 1 GPU available in a node current_node_id='3c39eadb1c3de59e429536907a2550edbafc34c7fb48dc2b02eedf35' 
current_ip='10.141.17.198'.
(run_inference_one_model pid=51426) INFO 02-08 16:04:12 ray_gpu_executor.py:134] use_ray_spmd_worker: False
(run_inference_one_model pid=51426) Calling ray.init() again after it has already been called.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data-parallel evaluation cause an error of no GPU available #245

data-parallel evaluation cause an error of no GPU available #245

BaohaoLiao commented Feb 8, 2025

data-parallel evaluation cause an error of no GPU available #245

data-parallel evaluation cause an error of no GPU available #245

Comments

BaohaoLiao commented Feb 8, 2025