vllm serv loading MiniCPM-V-2_6 in a very long time with 100% GPU #12402

yp05327 · 2025-01-24T11:18:53Z

yp05327
Jan 24, 2025

Run cmd:

vllm serve path_to_MiniCPM-V-2_6 --port 8888 --host 0.0.0.0 --quantization fp8 --trust-remote-code --api-key vllm --gpu_memory_utilization 1

After loading the weight to mem, it will take a very very long time to run the server, and for a 8b fp8 model, it took around 30GB mem after loading the weight. Then it will freeze for a long time then failed. I have tried to remove --quantization fp8 option, nothing changed.

It is running in a devcontainer in WSL2, I successfully start the server only once. Not sure what happend.

This is the configration of devcontainer:

{
    "name": "name",
    "image": "mcr.microsoft.com/devcontainers/python:3.11-bookworm",
    "init": true,
    "features": {
        "ghcr.io/devcontainers/features/nvidia-cuda:1.1.3": {
            "cudaVersion": "12.6",
            "installToolkit": true,
            "installCudnn": true,
            "installCudnnDev": true,
            "cudnnVersion": "9.6.0.74"
        }
    },
    "runArgs": [
        "--gpus",
        "all"
    ],
}

This is the nv driver information in WSL2

vLLM API server version 0.6.6.post1

Answered by yp05327

Jan 29, 2025

Thanks for your response!
I finally found that some options are required, or you will meet this issue.
The required options are in the official document: https://modelbest.feishu.cn/wiki/C2BWw4ZP0iCDy7kkCPCcX2BHnOf

After adding --dtype auto --max-model-len 2048 then it worked!
I don't know why (maybe one of them, maybe both of them), but if you want to get more information, I can help :)

View full answer

DarkLight1337 · 2025-01-29T04:09:38Z

DarkLight1337
Jan 29, 2025
Collaborator

Please follow the troubleshooting guide to debug this. If it still doesn't work, please open a GH issue so we can investigate further. (GH discussions aren't reviewed nearly as much)

1 reply

yp05327 Jan 29, 2025
Author

Thanks for your response!
I finally found that some options are required, or you will meet this issue.
The required options are in the official document: https://modelbest.feishu.cn/wiki/C2BWw4ZP0iCDy7kkCPCcX2BHnOf

After adding --dtype auto --max-model-len 2048 then it worked!
I don't know why (maybe one of them, maybe both of them), but if you want to get more information, I can help :)

Answer selected by yp05327

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vllm serv loading MiniCPM-V-2_6 in a very long time with 100% GPU #12402

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

vllm serv loading MiniCPM-V-2_6 in a very long time with 100% GPU #12402

yp05327 Jan 24, 2025

Replies: 1 comment · 1 reply

DarkLight1337 Jan 29, 2025 Collaborator

yp05327 Jan 29, 2025 Author

yp05327
Jan 24, 2025

Replies: 1 comment 1 reply

DarkLight1337
Jan 29, 2025
Collaborator

yp05327 Jan 29, 2025
Author