You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -64,6 +68,57 @@ For developers, install `mlperf-inf-mm-vl2l` and the development tools with:
64
68
pip install multimodal/vl2l/[dev]
65
69
```
66
70
71
+
After installation, you can check the CLI flags that `mlperf-inf-mm-vl2l` can take with:
72
+
73
+
```bash
74
+
mlperf-inf-mm-vl2l --help
75
+
```
76
+
77
+
You can enable shell autocompletion for `mlperf-inf-mm-vl2l` with:
78
+
79
+
```bash
80
+
mlperf-inf-mm-vl2l --install-completion
81
+
```
82
+
83
+
> NOTE: Shell auto-completion will take effect once you restart the terminal.
84
+
85
+
### Start an inference endpoint on your local host machine with vLLM
86
+
87
+
Please refer to [this guide on how to launch vLLM for various Qwen3 VL MoE models](https://docs.vllm.ai/projects/recipes/en/latest/Qwen/Qwen3-VL.html).
88
+
89
+
```bash
90
+
docker run --gpus all \ # Use all the GPUs on this host machine.
91
+
-v ~/.cache/huggingface:/root/.cache/huggingface \ # Use the HuggingFace cache from your host machine.
92
+
-p 8000:8000 \ # This assumes the endpoint will use port 8000.
93
+
--ipc=host \ # The container can access and utilize the host's IPC mechanisms (e.g., shared memory).
94
+
vllm/vllm-openai:nightly \ # You can also use the `:latest` container or a specific release.
95
+
--model Qwen/Qwen3-VL-235B-A22B-Instruct \ # Specifies the model for vLLM to deploy.
96
+
--tensor-parallel-size 8 \ # 8-way tensor-parallel inference across 8 GPUs.
97
+
--limit-mm-per-prompt.video 0 # The input requests will contain images only (i.e., no videos).
0 commit comments