- 
          
- 
                Notifications
    You must be signed in to change notification settings 
- Fork 10.8k
Closed
Labels
bugSomething isn't workingSomething isn't working
Description
Your current environment
Repro command below.
🐛 Describe the bug
Attempting to serve meta-llama/Llama-3.2-11B-Vision-Instruct with recent vLLM (>=v0.7.3), results in the error below during the execution of determine_num_available_blocks() during bootup
$ vllm serve meta-llama/Llama-3.2-11B-Vision-Instruct --max-num-seqs 8
Traceback (most recent call last):
  File "/opt/vllm/lib64/python3.12/site-packages/vllm/engine/multiprocessing/engine.py", line 400, in run_mp_engine
    engine = MQLLMEngine.from_engine_args(engine_args=engine_args,
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/vllm/lib64/python3.12/site-packages/vllm/engine/multiprocessing/engine.py", line 125, in from_engine_args
    return cls(ipc_path=ipc_path,
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/vllm/lib64/python3.12/site-packages/vllm/engine/multiprocessing/engine.py", line 77, in __init__
    self.engine = LLMEngine(*args, **kwargs)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/vllm/lib64/python3.12/site-packages/vllm/engine/llm_engine.py", line 277, in __init__
    self._initialize_kv_caches()
  File "/opt/vllm/lib64/python3.12/site-packages/vllm/engine/llm_engine.py", line 426, in _initialize_kv_caches
    self.model_executor.determine_num_available_blocks())
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/vllm/lib64/python3.12/site-packages/vllm/executor/executor_base.py", line 102, in determine_num_available_blocks
    results = self.collective_rpc("determine_num_available_blocks")
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/vllm/lib64/python3.12/site-packages/vllm/executor/executor_base.py", line 316, in collective_rpc
    return self._run_workers(method, *args, **(kwargs or {}))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/vllm/lib64/python3.12/site-packages/vllm/executor/mp_distributed_executor.py", line 185, in _run_workers
    driver_worker_output = run_method(self.driver_worker, sent_method,
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/vllm/lib64/python3.12/site-packages/vllm/utils.py", line 2196, in run_method
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/vllm/lib64/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/vllm/lib64/python3.12/site-packages/vllm/worker/worker.py", line 229, in determine_num_available_blocks
    self.model_runner.profile_run()
  File "/opt/vllm/lib64/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/vllm/lib64/python3.12/site-packages/vllm/worker/enc_dec_model_runner.py", line 341, in profile_run
    self.execute_model(model_input, kv_caches, intermediate_tensors)
  File "/opt/vllm/lib64/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/vllm/lib64/python3.12/site-packages/vllm/worker/enc_dec_model_runner.py", line 182, in execute_model
    hidden_or_intermediate_states = model_executable(
                                    ^^^^^^^^^^^^^^^^^
  File "/opt/vllm/lib64/python3.12/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/vllm/lib64/python3.12/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/vllm/lib64/python3.12/site-packages/vllm/model_executor/models/mllama.py", line 1392, in forward
    assert actual_len >= last_group_len
I have done some investigations, but do not have a fix yet... Here is what I have found:
- the error occurs because the dummy encoder sequences constructed for profiling are longer than the actual encoder len computed in mllama; for the single-image requests, this means greater than 6404 tokens
- serving the model works as long as max_seq_len / max_num_seqs <= 6404; with the full seq length--max-num-seq=21works
- I think this bug was introduced in [VLM] Implement merged multimodal processor for Mllama #11427
- before this PR there was a dummy_encoder_data_for_mllamafunction responsible for constructing the dummy data
 
- before this PR there was a 
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
Reichenbachian and hxhcreate
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working