Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

data-parallel evaluation cause an error of no GPU available #245

Open
BaohaoLiao opened this issue Feb 8, 2025 · 0 comments
Open

data-parallel evaluation cause an error of no GPU available #245

BaohaoLiao opened this issue Feb 8, 2025 · 0 comments

Comments

@BaohaoLiao
Copy link

I tried to use the data-parallel evaluation setting, i.e.

NUM_GPUS=2
MODEL=deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
MODEL_ARGS="pretrained=$MODEL,dtype=bfloat16,data_parallel_size=$NUM_GPUS,max_model_length=32768,gpu_memory_utilisation=0.8"
TASK=aime24
OUTPUT_DIR=data/evals/$MODEL

lighteval vllm $MODEL_ARGS "custom|$TASK|0|0" \
    --custom-tasks src/open_r1/evaluate.py \
    --use-chat-template \
    --output-dir $OUTPUT_DIR

But it causes the following error, since it doesn't find any available GPUS. When I chaneg the data_parallel_size to 1, it works.

RayTaskError(ValueError): [36mray::run_inference_one_model()[39m (pid=51424, ip=10.141.17.198)
  File "/opt/conda/lib/python3.10/site-packages/lighteval/models/vllm/vllm_model.py", line 339, in run_inference_one_model
    llm = LLM(**model_args)
  File "/opt/conda/lib/python3.10/site-packages/vllm/utils.py", line 1028, in inner
    return fn(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 210, in __init__
    self.llm_engine = self.engine_class.from_engine_args(
  File "/opt/conda/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 583, in from_engine_args
    executor_class = cls._get_executor_cls(engine_config)
  File "/opt/conda/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 558, in _get_executor_cls
    initialize_ray_cluster(engine_config.parallel_config)
  File "/opt/conda/lib/python3.10/site-packages/vllm/executor/ray_utils.py", line 296, in initialize_ray_cluster
    raise ValueError(
ValueError: Current node has no GPU available. current_node_resource={'node:__internal_head__': 1.0, 'node:10.141.17.198': 1.0, 'CPU': 6.0, 
'object_store_memory': 16320871833.0, 'memory': 182352875110.0, 'accelerator_type:A100': 1.0}. vLLM engine cannot start without GPU. Make 
sure you have at least 1 GPU available in a node current_node_id='3c39eadb1c3de59e429536907a2550edbafc34c7fb48dc2b02eedf35' 
current_ip='10.141.17.198'.
(run_inference_one_model pid=51426) INFO 02-08 16:04:12 ray_gpu_executor.py:134] use_ray_spmd_worker: False
(run_inference_one_model pid=51426) Calling ray.init() again after it has already been called.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant