Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
629 commits
Select commit Hold shift + click to select a range
73ff872
[Bugfix] Fix typo in Qwen3 Next model executor (#28960)
Nepherpitou Nov 19, 2025
6a25ea5
[Docs] Update oneshot imports (#28188)
UranusSeven Nov 19, 2025
3d4e7d3
[Model][QwenVL] Simplify cos/sin rotary embedding indexing (#28962)
lgeiger Nov 19, 2025
71d0ae1
[Misc] Update embedding/cross encoder tests to use `mteb` v2 (#27329)
Samoed Nov 19, 2025
a4511e3
Speed up macOS smoke test (#28954)
mgoin Nov 19, 2025
7ed27f3
[Doc]: fix typos in various files (#28945)
didier-durand Nov 19, 2025
ae4821a
Add CPU support model (#28697)
louie-tsai Nov 19, 2025
d69062c
add support for --fully-sharded-loras in fused_moe (#28761)
gnovack Nov 19, 2025
fdf9348
[Docs] Clean up moe_kernel_features.md (#28530)
windsonsea Nov 19, 2025
8151609
refactor(cpu_types_scalar.hpp): Unify scalar loop implementations usi…
ihb2032 Nov 19, 2025
bbc6c2f
[CI/Build] Fix broken build on Apple M1 (#28999)
j20120307 Nov 19, 2025
97cfa99
[Docs] Take env var definition out of folded admonition (#29005)
hmellor Nov 19, 2025
ba558c0
[config] Expose `get_total_num_hidden_layers()` in ModelConfig (#28961)
ptovam Nov 19, 2025
da2f680
[Feat][Perf] Enable deepep-low-latency with round-robin expert placem…
cboss6 Nov 19, 2025
09540cd
[Doc]: fix typos in various files (#29010)
didier-durand Nov 19, 2025
4f5299f
Relax Transformers modeling backend MoE experts check (#28952)
hmellor Nov 19, 2025
2c8b918
[CI] Reorganize compile tests so new tests are automatically included…
gmagogsfm Nov 19, 2025
1ffe934
[torch.compile] caching of config fields should be opt-out by default…
vnadathur Nov 19, 2025
48fc8b1
[BugFix] Fix async-scheduling + FlashAttn MLA (#28990)
LucasWilkinson Nov 19, 2025
d44e9df
[Model][Mamba] Add selector for mamba attention backend and make it p…
shen-shanshan Nov 19, 2025
a8b7030
Update `rope_scaling` to `rope_parameters` in preparation for Transfo…
hmellor Nov 19, 2025
0c80efd
GLM-V video segmentation solution adjustment (#28941)
zRzRzRzRzRzRzR Nov 19, 2025
61728cd
Re-enable FlashInfer for Llama4 on Blackwell in e2e fusion tests (#28…
Copilot Nov 19, 2025
3319a49
[Core] Reuse created spec tokens lists to mitigate GC cost (#28917)
Jialin Nov 19, 2025
fe69f33
[Kernels] Improve H200 Fused MoE Config (#28992)
robertgshaw2-redhat Nov 19, 2025
9d2d561
[Bugfix] Fix precision corruption when shared_experts_stream=None (#…
zhyajie Nov 19, 2025
ac10fd3
Upstreaming aiter triton attention backend as a new backend (#28701)
maleksan85 Nov 19, 2025
02f5903
Eagle: MM Cuda Graphs with MRope (#28896)
IzzyPutterman Nov 19, 2025
2fd893b
[Feature] Prefill Context Parallel (PCP) basic support (#28718)
pisceskkk Nov 19, 2025
68d7231
[CI/Build] Fix test_prefix_prefill for AMD (#28905)
rjrock Nov 19, 2025
1607e66
[Bug] Fix Batch Invariant MLA test (#28967)
yewentao256 Nov 19, 2025
cdeec2e
[BugFix] Ray with multiple nodes (#28873)
juliendenize Nov 19, 2025
613abb5
[MoE] Nvfp4 Masked Gemm: Add flashinfer grouped_gemm_nt_masked (#25990)
wenscarl Nov 19, 2025
88f5b19
[DeepSeek] Fix DeepSeek V3.2 Rope Embedding (#28968)
zyongye Nov 19, 2025
22e44ad
[ROCm][CI] Fix Weight Loading With Multiple GPU Tests on ROCm (#28984)
micah-wil Nov 19, 2025
8f4f77a
[BugFix] Fix false assertion with spec-decode=[2,4,..] and TP>2 (#29036)
LucasWilkinson Nov 19, 2025
cb0a7b4
[Bugfix] Move flashinfer kernel check into ```__init__``` function of…
maxyanghu Nov 19, 2025
0075bff
[CI] Fix precommit `rope_theta` issue (#29040)
yewentao256 Nov 19, 2025
8e38e99
[Feature] EPLB on Qwen3VLMoe and CompressedTensorsWNA16MoEMethod (#28…
JartX Nov 19, 2025
3aaa94a
[Performance] Reduce DeepGEMM N dim restriction from 128 to 64 multip…
alexm-redhat Nov 19, 2025
5031cd5
[Refactor] Optimize `select_experts` (#28069)
yewentao256 Nov 19, 2025
537cc63
[GC Debugger] Simply and improve GC Debugger Utils (#29029)
Jialin Nov 20, 2025
9ccef8e
[Misc] Colorize logs (#29017)
njhill Nov 20, 2025
1d64287
[torchao] fix safetensors for sharding (#28169)
liangel-02 Nov 20, 2025
05c2dee
[DeepSeek + LMCache Multiprocess] handle MLA for deepseek model + LMC…
KuntaiDu Nov 20, 2025
3fb0d90
[AMD] Use Decoupled Kernel Block Size to Support AITER MLA block_size…
zq1997 Nov 20, 2025
3168285
[cpu][ci] Add initial set of tests for Arm CPUs (#28657)
fadara01 Nov 20, 2025
fcbcba6
[Feat] Iteration-level profiling for Torch and CUDA profiler (#28987)
benchislett Nov 20, 2025
a8c5368
Consolidate Nvidia ModelOpt quant config handling for all quantizatio…
shengliangxu Nov 20, 2025
0cca9b4
[Bugfix] Fix precision loss in LoRA-wrapped RowParallelLinear by fusi…
prashanth058 Nov 20, 2025
fe25772
[Bugfix] Handle broken frames in video loading (#29001)
gcanlin Nov 20, 2025
64192d5
[Bugfix] Revert custom attention mask for gemma3-mm (#28995)
Isotr0py Nov 20, 2025
a9705a2
[Model][QwenVL] Replace `torch.repeat_interleave` with faster `np.rep…
lgeiger Nov 20, 2025
1c7bcc5
[Frontend] Allow parsed tool arguments (#28820)
qgallouedec Nov 20, 2025
20e4497
[V0 Deprecation] Remove `num_lookahead_slots` (#29000)
DarkLight1337 Nov 20, 2025
7218f83
[ROCm][BugFix] Fix shared expert loading error when disable `VLLM_ROC…
ganyi1996ppo Nov 20, 2025
1e1c067
[ci][amd] fix EPLB execution test (#28742)
bradleyhd Nov 20, 2025
2c52c7f
[Bug] Fix torch dynamo warning Dynamo detected a call to a `functools…
yewentao256 Nov 20, 2025
322cb02
[CI/Build][AMD] Fix import errors in tests/kernels/attention (#29032)
rasmith Nov 20, 2025
a903d59
cleanup at::Tag::needs_fixed_stride_order (#28974)
BoyuanFeng Nov 20, 2025
fb8851f
[Bugfix][cache_kernels]: Fix OOB in cache_kernels.cu (#28760)
Flink-ddd Nov 20, 2025
dc45efc
[BugFix] Fix Llama4 Pipeline Parallelism Assert Error (#28577)
River12 Nov 20, 2025
edfe867
[Misc] don't cache `CUTLASS_REVISION` var in CMakeLists.txt (#28518)
jinzhen-lin Nov 20, 2025
66483a9
[Chore] Update `xgrammar` version from 0.1.25 to 0.1.27 (#28221)
cjackal Nov 20, 2025
6eb745d
Add truncate arg to yarn to match openai implementation of gpt-oss (#…
ashors1 Nov 20, 2025
06c20c9
[ROCm] Add AMD GPU support on Deepseek v3.2 and SparseMLA (#26670)
ganyi1996ppo Nov 20, 2025
c0c2dd1
[BugFix] kv_offloading: Fix bug in loading of partial cpu blocks (#28…
orozery Nov 20, 2025
c9e0931
[MODEL] Implement plamo3 (#28834)
Alnusjaponica Nov 20, 2025
371b1d4
[RL] Add Pause and Resume Generation for Asynchronous RL Training (#2…
SamitHuang Nov 20, 2025
93c8672
[Bugfix] Fix spec decode memory regression after #28549 (#28819)
zhewenl Nov 20, 2025
a2e9ebe
[BugFix] Fix flash_attn import in `siglip2navit.py` (#29082)
faaany Nov 20, 2025
82b05b1
[BugFix] [FEAT] Enable fastsafetensors for ROCm platform (#28225)
tjtanaa Nov 20, 2025
56f45ed
[Frontend] Optimize beam search loop by sorting and then splicing (#1…
zhanggzh Nov 20, 2025
2292438
Updating the mirror of test-amd.yaml as of 2025-11-18 (#29016)
Alexei-V-Ivanov-AMD Nov 20, 2025
e5bfcb6
[BugFix][PD]: make example proxy usable with P2pNcclConnector (#26628)
pandalee99 Nov 20, 2025
6474647
[KVConnector][Core] Support cross-layer KV blocks (#27743)
orozery Nov 20, 2025
114b0e2
[chore] Update annotate release scripts (#29077)
khluu Nov 20, 2025
4d01b64
[Bugfix] - Add Trace Headers to Beam Search Path (#29100)
dsuhinin Nov 20, 2025
3d84ef9
[CI/Build][AMD] Skip if flash_attn_varlen_func not available in test_…
rasmith Nov 20, 2025
5e5a7eb
[CI/Build] Make test_attention_selector.py run tests on correct platf…
rasmith Nov 20, 2025
3fd7418
Fixes bench (#29058)
drisspg Nov 20, 2025
8237ab8
[CI/Build] Skip lm-format-enforcer tests in test_struct_output_genera…
rasmith Nov 20, 2025
c7a29d2
[CI/Build] Remove skip global cleanup in test_struct_output_generate.…
rasmith Nov 20, 2025
dd39f91
[Doc] cleanup TPU documentation and remove outdated examples (#29048)
RobMulla Nov 21, 2025
986ab5d
[CI Bugfix] Fix Kernels DeepGEMM Test (H100) (#29106)
mgoin Nov 21, 2025
87cbbdf
Update model references for OLMo3 (#29099)
mgoin Nov 21, 2025
df44df0
[Feature] Shared Experts Overlap with FI deepgemm swap kernel, 2.2% t…
yewentao256 Nov 21, 2025
9875be6
[LoRA][2/2]Remove LoRA extra vocab (#28545)
jeejeelee Nov 21, 2025
ed6ae1e
[AITER] [ROCm] Fix crash when loading llama4 model with old aiter ver…
xli Nov 21, 2025
e1eefa4
[Bug] Fix torch warning of tf32 usage (#29112)
yewentao256 Nov 21, 2025
3f5f36d
[ROCm] Fix for import when building with upstream triton for gfx1100 …
hongxiayang Nov 21, 2025
56669c1
[CI] Fix mypy for `vllm/v1/worker` (#29037)
yewentao256 Nov 21, 2025
0e741c1
[Bugfix] Fix Plamo3 rope handling (#29092)
DarkLight1337 Nov 21, 2025
a982f5b
[kernel][perf] support uncontiguous input for rms_norm kernel (#28103)
izhuhaoran Nov 21, 2025
0730414
[Core] Add audio_embeds support to chat completions (#29059)
jeremyteboul Nov 21, 2025
698024e
[Doc] update installation guide regarding aarch64+cuda pytorch build …
soodoshll Nov 21, 2025
56e96b3
[V0 Deprecation] Remove `best_of` (#29090)
DarkLight1337 Nov 21, 2025
8c25f9c
[BugFix] skip combo kernel on cpu (#29129)
BoyuanFeng Nov 21, 2025
11857a0
[Attention] Add ROCM_AITER_MLA_SPARSE to attention backend registry (…
MatthewBonanni Nov 21, 2025
30b9c67
Revert "[Redo] #26368 (#28771)" (#29121)
Jialin Nov 21, 2025
b4734b9
[Bugfix] Fix default MM LoRA alignment for single str prompts (#29140)
alex-jw-brooks Nov 21, 2025
e4c3182
[Small] Capture AttributeError when checking ray dependency. (#29024)
huachenheli Nov 21, 2025
7d6da48
[Minor][Clean] Remove the legacy assertion in video (#29150)
gcanlin Nov 21, 2025
8ac3a41
[CI Failure] Fix Gemma3 RoPE configuration for sliding attention laye…
hl475 Nov 21, 2025
4d7231e
Revert #28875 (#29159)
DarkLight1337 Nov 21, 2025
b34129b
[Misc] remove useless v1 env (#29164)
david6666666 Nov 21, 2025
aab0102
[V0 deprecation] Remove more V0 references (#29088)
DarkLight1337 Nov 21, 2025
cca2d2c
[Core] Align whisper closer to other multimodal models (#27292)
russellb Nov 21, 2025
2b1b3df
Update Dockerfile to use gcc-toolset-14 and fix test case failures on…
bhagyashrigai Nov 21, 2025
9452863
Revert "Revert #28875 (#29159)" (#29179)
DarkLight1337 Nov 21, 2025
fc9f821
fix cross attention (#28346)
fsx950223 Nov 21, 2025
2092ce8
Tool Call Parser logs should not contain user input / model output ex…
sfbemerk Nov 21, 2025
434f3d3
Fix mistral config (#29172)
juliendenize Nov 21, 2025
f1805db
[Perf] These changes enhance the NUMA functionality of vllm for syste…
skaraban3807 Nov 21, 2025
4050bae
[Doc] Update plugin doc (#28532)
wangxiyuan Nov 21, 2025
d7219bc
[Misc] Move dynamic seed initialization to `EngineArgs` (#29165)
DarkLight1337 Nov 21, 2025
711241c
[CI/Build] Fix illegal memory access and unsupported test in kernels/…
rasmith Nov 21, 2025
1f400c5
[CI] Add batch invariant test to ci (#27842)
yewentao256 Nov 21, 2025
30b44a1
GPU Model Runner V2 (#25266)
WoosukKwon Nov 21, 2025
b7f1f49
Upstream triton fp4 weight preshuffle (#28888)
maleksan85 Nov 21, 2025
a42ab31
[Log] Optimize startup log (#28948)
yewentao256 Nov 21, 2025
e99e467
[CI/Build][Kernel][AMD] Move extra dim to after load in _fwd_kv_paral…
rasmith Nov 21, 2025
b4c8fba
Add TRTLLM MoE NVFP4 kernel to CompressedTensorsW4A4MoeMethod (#28892)
Victor49152 Nov 21, 2025
460d02a
[NIXL] Fix after virtual block_size for host_buffer with heter kv_lay…
xuechendi Nov 21, 2025
75648b1
[ROCm][CI] Fix config/test_config_generation.py (#29142)
charlifu Nov 21, 2025
ceca060
[Deprecation] Deprecate `seed=None` (#29185)
DarkLight1337 Nov 21, 2025
1bed891
[Chore] Fix pre-commit error after #25266 (#29190)
WoosukKwon Nov 21, 2025
1840c5c
[BugFix] Make sure to allocate worst case MoE workspace during profil…
LucasWilkinson Nov 21, 2025
53a1ba6
[log] add weights loading time log to sharded_state loader (#28628)
andyxning Nov 21, 2025
c68c7b4
[BugFix] Fix missing symbol triggering FA2 fallback on Hopper (#29107)
LucasWilkinson Nov 21, 2025
57430fc
Default model load/config/tokenizer to `mistral` format if relevant f…
juliendenize Nov 21, 2025
3137991
[BugFix] EPLB + B200 + DeepGEMM : Handle column-major scales tensor (…
varun-sundar-rabindranath Nov 21, 2025
c6fa389
[KV Connector] Fix async connector prefix cache metrics (#28585)
markmc Nov 21, 2025
e9af6ba
[Model Runner V2] Optimize Gumbel Sampling Kernel (#29210)
WoosukKwon Nov 21, 2025
30d6466
[BugFix] Fix Eagle `IndexError: list index out of range` for even `nu…
LucasWilkinson Nov 22, 2025
d5dbdbf
[docs] Fix cudagraph mode config (#29170)
angelayi Nov 22, 2025
9a3101b
[Rocm][CI] Fix DeekSeek V2-Lite Accuracy CI (#29135)
charlifu Nov 22, 2025
1d34eb1
[CI] Bug: Fix triton import issue (#29202)
yewentao256 Nov 22, 2025
d045e22
[Model][Qwen3VL] Tune Triton w8a8 block fp8 kernel for L40s (#29217)
lgeiger Nov 22, 2025
ed8e684
[CI/Build] Add terratorch for AMD (#29205)
rjrock Nov 22, 2025
5c8f2ad
[Bugfix] Fix block size in block_table with PCP (#29094)
Livinfly Nov 22, 2025
1ef9c9e
[CI/Build] Disable test_gptoss_tp.py in 'LoRA TP Test' group for ROCm…
qli88 Nov 22, 2025
052950e
Add fused MoE config for H200 E160 N192 fp8 (#29182)
FlintyLemming Nov 22, 2025
6f40350
[CI/Build][AMD] Enable Entrypoints Integration Test (Pooling) to run …
rasmith Nov 22, 2025
77e1c03
[chore][LMCache connector] Remove useless logs from lmcache connector…
ApostaC Nov 22, 2025
fd65015
[CI/Build] Only use supported types and features on ROCm in MoE kerne…
rasmith Nov 22, 2025
933f67e
[Bugfix]Fix a conditional to not check zero value (#28754)
gmagogsfm Nov 22, 2025
1489902
[LoRA] Cleanup FusedMoEWithLoRA (#29187)
jeejeelee Nov 22, 2025
e905605
[Model Runner V2] Limit cudagraph size to max decode batch size (#29221)
WoosukKwon Nov 22, 2025
742e9ff
[responsesAPI] parse reasoning item input (#28248)
qandrew Nov 22, 2025
ea38474
[Frontend][Responses API] Multi-turn (with type: "output_text") suppo…
madskildegaard Nov 22, 2025
988ee66
Handle triton kernel import exception (#29062)
hjh0119 Nov 22, 2025
e6309ac
Simplify `from_blob` usage in `get_cuda_view_from_cpu_tensor` (#29027)
janeyx99 Nov 22, 2025
a4fdf24
[CI/Build] Skip tests that require libcudart in test_lmcache_integrat…
rasmith Nov 22, 2025
8e22da1
[CI/Build Don't add FLASHINFER backend in test_cpu_offloading.py (#29…
rasmith Nov 22, 2025
5a48025
[Misc] Further clean up chunked prefill and prefix caching init (#29186)
DarkLight1337 Nov 22, 2025
6965a39
Fix: Resolve circular import in model_loader/utils.py (#29189)
nandan2003 Nov 22, 2025
2d4978a
fix: clean up function never use in setup.py (#29061)
yihong0618 Nov 22, 2025
5f7209a
[tiny] Remove unsupported TRITON_MLA backend from batch invariance (#…
bwasti Nov 22, 2025
066209a
[Attention] Refactor FA `block_size` limitations to hybrid models onl…
NickLucche Nov 22, 2025
d44a63c
[BugFix] Fix returned logprobs with spec decode + prefill chunking (#…
njhill Nov 22, 2025
ae66818
[Misc] Fix pre-commit (#29238)
DarkLight1337 Nov 22, 2025
d84d8f4
Fix EVS crash when using `video_embeds` inputs in Qwen2.5-VL (#29232)
skyloevil Nov 22, 2025
f55c76c
chore: add RTX_PRO_6000 GLM4.6-FP8 kernel tuning (#29240)
coval3nte Nov 22, 2025
730bd35
[perf][cpu] Accelerate paged attention GEMMs (QK, PV) on Arm CPUs wit…
fadara01 Nov 22, 2025
d1cf821
[Bugfix] Use HF config fields as fallback when loading Mistral config…
DarkLight1337 Nov 22, 2025
eb5352a
[CI/build] Removes source compilation from runtime image (#26966)
bbartels Nov 22, 2025
7df331c
[BugFix] Fix chunked prompt logprobs + preemption (#29071)
njhill Nov 22, 2025
df78aee
Refactor: Move CUDA graph dispatch logic earlier (#27382)
yiz-liu Nov 22, 2025
472fdee
[Chore] Update batch invariant code owner (#29246)
yewentao256 Nov 22, 2025
4587063
Patch DeepEP when building docker image with CUDA 13 (#29154)
soodoshll Nov 22, 2025
5f96c00
[Fix] Add SM check to flashinfer MOE backend (#29144)
jiahanc Nov 23, 2025
3ed767e
docs: fixes distributed executor backend config for multi-node vllm (…
michaelact Nov 23, 2025
389aa1b
[Doc] Update more docs with respect to V1 (#29188)
DarkLight1337 Nov 23, 2025
20ee418
[Model Runner V2] Minor fix for cudagraph_utils (#29256)
WoosukKwon Nov 23, 2025
71362ff
[CI/Build][AMD] Skip test_multi_shared_storage_connector_consistency …
rasmith Nov 23, 2025
3999442
[CI/Build][AMD] Add check for flash_att_varlen_func to test_tree_atte…
rasmith Nov 23, 2025
55c21c8
[ROCm][CI] Fix "Cannot re-initialize CUDA in forked subprocess" in te…
micah-wil Nov 23, 2025
6fb0215
[Bugfix] Use lazy string reference for DeepseekV3Config in config reg…
yongming-qin Nov 23, 2025
7f12c82
[Model Runner V2] Change bookkeeping logic in preparation for spec de…
WoosukKwon Nov 23, 2025
b004c00
[Model Runner V2] Support spec decoding [1/N] (#29274)
WoosukKwon Nov 23, 2025
62d54ba
[Model Runner V2] Optimize CUDA graph capture time (#29275)
WoosukKwon Nov 23, 2025
3e1ad40
[Model Runner V2] Add apply_temperature option to gumbel_sample (#29276)
WoosukKwon Nov 23, 2025
c309bb5
[Bugfix] Update Gradio OpenAI Chatbot Webserver example to new Gradio…
joshiemoore Nov 24, 2025
1073ba6
[LoRA] Optimize 3D MoE logic (#29222)
jeejeelee Nov 24, 2025
3085478
[Model] Add OpenCUA-7B support (#29068)
lim4349 Nov 24, 2025
5253f42
[ROCm] Support for Whisper v1 with Aiter Unified Attention and Aiter …
apinge Nov 24, 2025
0ff7082
[Core] Deprecate `xformers` (#29262)
ywang96 Nov 24, 2025
ed40d85
[BugFix] Fix R-VL model loading error (#29299)
faaany Nov 24, 2025
68dfe28
[Feature][Benchmark] add --link-vars can filter when serve_param equa…
lengrongfu Nov 24, 2025
8005e60
[Bugfix][Rocm] Fix shared expert weight loading failure in DeepSeek-M…
zhyajie Nov 24, 2025
eca7a8f
[Doc]: fix typos in various files (#29230)
didier-durand Nov 24, 2025
4de8786
[CPU][IBM Z] Fix BF16 support and vectorize math operations for s390x…
R3hankhan123 Nov 24, 2025
2601f18
[EPLB] Optimize EPLB for Async Rearrange Experts (#22179)
david6666666 Nov 24, 2025
f716a15
Update KServe guide link in documentation (#29258)
terrytangyuan Nov 24, 2025
7a228b5
Add option to use unbacked, and backed size obl dynamic shapes for mo…
laithsakka Nov 24, 2025
e48b2e6
[Bugfix] [ROCm] [UX] Reorganize ROCm Backend Selection Logic (#26980)
vllmellm Nov 24, 2025
656516c
[Bugfix] properly handle nested json with llama3 tool parser (#27701)
Aydin-ab Nov 24, 2025
e924bbb
[Build/CI][DP/EP] Add QWen/Qwen3-30B-A3B-FP8 + EPLB tests to Nightly …
varun-sundar-rabindranath Nov 24, 2025
26a4655
[NIXL] Use config to enable telemetry + NIXL version bump (#29305)
NickLucche Nov 24, 2025
cc313cb
[Model Runner V2] Implement Single-step Eagle 1 (#29300)
WoosukKwon Nov 24, 2025
cec418b
[Model Runner V2] Change Numba AoT to JIT (#29328)
WoosukKwon Nov 24, 2025
8f06614
[MoE][Refactor] Make select_experts a non-static method (#29067)
bnellnm Nov 24, 2025
839c6b7
[Multimodal][Qwen3 Omni] Make Qwen3 Omni work with audio-in-video inp…
huachenheli Nov 24, 2025
97588c4
[Model Runner V2] Add minor clarification comments for Eagle (#29332)
WoosukKwon Nov 24, 2025
4d6afca
[CI/Build] Moves to cuda-base runtime image while retaining minimal J…
bbartels Nov 24, 2025
3cfa63a
[XPU]fix Kimi-VL-A3B-thinking on xpu (#29309)
yma11 Nov 24, 2025
f32c7d6
[Model Runner V2] Simplify Eagle bookkeeping with num_rejected (#29347)
WoosukKwon Nov 24, 2025
84371da
[Tests] Verify gpt_oss package is installed in harmony tests (#29336)
njhill Nov 24, 2025
4dd42db
Remove VLLM_SKIP_WARMUP tip (#29331)
tlrmchlsmth Nov 24, 2025
71df2a5
[Hybrid Allocator] Better layer padding strategy for gpt-oss eagle (#…
heheda12345 Nov 24, 2025
c17610e
[Bugfix] Only use triton_kernels for MXFP4 on SM90 and SM100 (#29339)
mgoin Nov 24, 2025
699bca7
[UX] Raise error for attn backend of batch invariant (#29348)
yewentao256 Nov 25, 2025
5f9679a
[Spec Decode] Add support for EAGLE3 heads that do not use_aux_hidden…
hjjq Nov 25, 2025
b8328b4
[XPU] upgrade torch & ipex 2.9 on XPU platform (#29307)
jikunshang Nov 25, 2025
a178a0b
[BugFix] Fix duplicate id tool-call race condition (#29355)
njhill Nov 25, 2025
a4ad43a
Scheduled removal of `ParallelConfig`'s direct child EPLB fields (#29…
hmellor Nov 25, 2025
6f1355a
[Perf] Disable DeepGEMM MoE by default when TP=8 is used (#29346)
mgoin Nov 25, 2025
77e10c9
[Perf][Deepseek] optimize gather_and_maybe_dequant_cache kernel's per…
ganyi1996ppo Nov 25, 2025
cb7214d
[ROCm][MLA] enable fp8 MLA decode on ROCm (#28032)
gbyu-amd Nov 25, 2025
22b42b5
[CI][ROCm] Install arctic-inference on ROCm tests (#29344)
divakar-amd Nov 25, 2025
7012d8b
[Docker] Optimize Dockerfile: consolidate apt-get and reduce image si…
princepride Nov 25, 2025
9cf4eda
[Metrics] Scheduled removal of deprecated metrics (#29330)
markmc Nov 25, 2025
87185c8
[Bugfix] Make deprecated `--task embedding` consistent with `--runner…
maryamtahhan Nov 25, 2025
92effb0
[Model] Add HunyuanOCR support (#29327)
Isotr0py Nov 25, 2025
81db702
[Attention] add `_cudagraph_support` for linear attention (#28934)
ZJY0516 Nov 25, 2025
2d9ee28
[CI/Test Fix] Fix CP tests on Blackwell (#29338)
LucasWilkinson Nov 25, 2025
316c849
Scheduled removal of `guided_*` config fields (#29326)
hmellor Nov 25, 2025
a21256c
Add TP CLI argument to multimodal inference examples (#29301)
faaany Nov 25, 2025
ce58fdc
Fix PoolingParams.skip_reading_prefix_cache type (#29364)
kflu Nov 25, 2025
40a6f53
Display warning only when ROCm version is less than Pytorch required …
Inokinoki Nov 25, 2025
eec9037
sync upstream
kliuae Nov 25, 2025
7992324
[BugFix] Use unique ids for different transcription prompts (#29372)
njhill Nov 25, 2025
64deead
[Bugfix] [ROCm] [UX]: revert Flex attention backend (#29371)
vllmellm Nov 25, 2025
98caead
[fix][cpu] Use a SwigluOAI impl which supports interleaved gate-up we…
fadara01 Nov 25, 2025
fe3a4f5
[CI/Build] Pin torchgeo dependency for AMD (#29353)
rjrock Nov 25, 2025
888152b
Allow oot custom compiler extension via CompilerInterface (#28623)
wxsIcey Nov 25, 2025
f242cfc
[Perf] use cpu all reduce to avoid sync when async_scheduling & dp > …
izhuhaoran Nov 25, 2025
12c007e
EAGLE Support DP>1 (#26086)
Flechman Nov 25, 2025
ef1f703
[ROCm][CI] Fix test_cudagraph_mode failure in AMD CI (#29367)
micah-wil Nov 25, 2025
6330f94
[Bugfix] Fix GPT-OSS AR+NORM fusion (#28841)
elvischenv Nov 25, 2025
67fc16c
[Bugfix] If chunked_prefill is disabled, end the scheduling early. (#…
noooop Nov 25, 2025
db29061
[Misc] Streamline unique id generation (#29375)
njhill Nov 25, 2025
9961a6e
sync upstream
kliuae Nov 25, 2025
b651d16
sync and resolve conflicts
kliuae Nov 25, 2025
3217a4c
pre-commit
kliuae Nov 26, 2025
d8214bd
[bugfx mxpf4] Infer mxfp4 quantmethod from layer
ZhiweiYan-96 Nov 27, 2025
43c8727
Use aw4a16 config
ZhiweiYan-96 Nov 28, 2025
0b7be71
lint
ZhiweiYan-96 Nov 28, 2025
e733c1d
Add comments
ZhiweiYan-96 Nov 28, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  

This file was deleted.

2 changes: 1 addition & 1 deletion .buildkite/release-pipeline.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -132,7 +132,7 @@ steps:
queue: cpu_queue_postmerge
commands:
- "aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin public.ecr.aws/q9t5s3a7"
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg GIT_REPO_CHECK=1 --build-arg VLLM_CPU_AVX512BF16=true --build-arg VLLM_CPU_AVX512VNNI=true --tag public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo:$(buildkite-agent meta-data get release-version) --tag public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo:latest --progress plain --target vllm-openai -f docker/Dockerfile.cpu ."
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg GIT_REPO_CHECK=1 --build-arg VLLM_CPU_AVX512BF16=true --build-arg VLLM_CPU_AVX512VNNI=true --build-arg VLLM_CPU_AMXBF16=true --tag public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo:$(buildkite-agent meta-data get release-version) --tag public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo:latest --progress plain --target vllm-openai -f docker/Dockerfile.cpu ."
- "docker push public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo:latest"
- "docker push public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo:$(buildkite-agent meta-data get release-version)"
env:
Expand Down
30 changes: 19 additions & 11 deletions .buildkite/scripts/annotate-release.sh
Original file line number Diff line number Diff line change
Expand Up @@ -2,22 +2,29 @@

set -ex

# Get release version and strip leading 'v' if present
RELEASE_VERSION=$(buildkite-agent meta-data get release-version | sed 's/^v//')

if [ -z "$RELEASE_VERSION" ]; then
echo "Error: RELEASE_VERSION is empty. 'release-version' metadata might not be set or is invalid."
exit 1
# Get release version, default to 1.0.0.dev for nightly/per-commit builds
RELEASE_VERSION=$(buildkite-agent meta-data get release-version 2>/dev/null | sed 's/^v//')
if [ -z "${RELEASE_VERSION}" ]; then
RELEASE_VERSION="1.0.0.dev"
fi

buildkite-agent annotate --style 'info' --context 'release-workflow' << EOF
To download the wheel:
To download the wheel (by commit):
\`\`\`
aws s3 cp s3://vllm-wheels/${BUILDKITE_COMMIT}/vllm-${RELEASE_VERSION}-cp38-abi3-manylinux1_x86_64.whl .
aws s3 cp s3://vllm-wheels/${BUILDKITE_COMMIT}/vllm-${RELEASE_VERSION}-cp38-abi3-manylinux2014_aarch64.whl .

aws s3 cp s3://vllm-wheels/${BUILDKITE_COMMIT}/vllm-${RELEASE_VERSION}+cu129-cp38-abi3-manylinux1_x86_64.whl .
aws s3 cp s3://vllm-wheels/${BUILDKITE_COMMIT}/vllm-${RELEASE_VERSION}+cu129-cp38-abi3-manylinux1_x86_64.whl .
\`\`\`

To download the wheel (by version):
\`\`\`
aws s3 cp s3://vllm-wheels/${RELEASE_VERSION}/vllm-${RELEASE_VERSION}-cp38-abi3-manylinux1_x86_64.whl .
aws s3 cp s3://vllm-wheels/${RELEASE_VERSION}/vllm-${RELEASE_VERSION}-cp38-abi3-manylinux2014_aarch64.whl .

aws s3 cp s3://vllm-wheels/${RELEASE_VERSION}+cu126/vllm-${RELEASE_VERSION}+cu126-cp38-abi3-manylinux1_x86_64.whl .
aws s3 cp s3://vllm-wheels/${RELEASE_VERSION}+cu129/vllm-${RELEASE_VERSION}+cu129-cp38-abi3-manylinux1_x86_64.whl .
aws s3 cp s3://vllm-wheels/${RELEASE_VERSION}+cu130/vllm-${RELEASE_VERSION}+cu130-cp38-abi3-manylinux1_x86_64.whl .
\`\`\`

To download and upload the image:
Expand All @@ -38,9 +45,10 @@ docker tag vllm/vllm-openai:aarch64 vllm/vllm-openai:v${RELEASE_VERSION}-aarch64
docker push vllm/vllm-openai:latest-aarch64
docker push vllm/vllm-openai:v${RELEASE_VERSION}-aarch64

docker manifest create vllm/vllm-openai:latest vllm/vllm-openai:latest-x86_64 vllm/vllm-openai:latest-aarch64 --amend
docker manifest create vllm/vllm-openai:v${RELEASE_VERSION} vllm/vllm-openai:v${RELEASE_VERSION}-x86_64 vllm/vllm-openai:v${RELEASE_VERSION}-aarch64 --amend
docker manifest rm vllm/vllm-openai:latest
docker manifest create vllm/vllm-openai:latest vllm/vllm-openai:latest-x86_64 vllm/vllm-openai:latest-aarch64
docker manifest create vllm/vllm-openai:v${RELEASE_VERSION} vllm/vllm-openai:v${RELEASE_VERSION}-x86_64 vllm/vllm-openai:v${RELEASE_VERSION}-aarch64
docker manifest push vllm/vllm-openai:latest
docker manifest push vllm/vllm-openai:v${RELEASE_VERSION}
\`\`\`
EOF
EOF
18 changes: 7 additions & 11 deletions .buildkite/scripts/hardware_ci/run-amd-test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -59,7 +59,7 @@ while true; do
fi
done

echo "--- Pulling container"
echo "--- Pulling container"
image_name="rocm/vllm-ci-private:${BUILDKITE_COMMIT}"
container_name="rocm_${BUILDKITE_COMMIT}_$(tr -dc A-Za-z0-9 < /dev/urandom | head -c 10; echo)"
docker pull "${image_name}"
Expand All @@ -78,17 +78,13 @@ HF_MOUNT="/root/.cache/huggingface"
commands=$@
echo "Commands:$commands"

if [[ $commands == *"pytest -v -s basic_correctness/test_basic_correctness.py"* ]]; then
commands=${commands//"pytest -v -s basic_correctness/test_basic_correctness.py"/"VLLM_USE_TRITON_FLASH_ATTN=0 pytest -v -s basic_correctness/test_basic_correctness.py"}
fi
commands=${commands//"pytest -v -s basic_correctness/test_basic_correctness.py"/"pytest -v -s basic_correctness/test_basic_correctness.py"}

if [[ $commands == *"pytest -v -s models/test_registry.py"* ]]; then
commands=${commands//"pytest -v -s models/test_registry.py"/"pytest -v -s models/test_registry.py -k 'not BambaForCausalLM and not GritLM and not Mamba2ForCausalLM and not Zamba2ForCausalLM'"}
fi

if [[ $commands == *"pytest -v -s compile/test_basic_correctness.py"* ]]; then
commands=${commands//"pytest -v -s compile/test_basic_correctness.py"/"VLLM_USE_TRITON_FLASH_ATTN=0 pytest -v -s compile/test_basic_correctness.py"}
fi
commands=${commands//"pytest -v -s compile/test_basic_correctness.py"/"pytest -v -s compile/test_basic_correctness.py"}

if [[ $commands == *"pytest -v -s lora"* ]]; then
commands=${commands//"pytest -v -s lora"/"VLLM_ROCM_CUSTOM_PAGED_ATTN=0 pytest -v -s lora"}
Expand Down Expand Up @@ -181,13 +177,13 @@ if [[ -z "$render_gid" ]]; then
exit 1
fi

# check if the command contains shard flag, we will run all shards in parallel because the host have 8 GPUs.
# check if the command contains shard flag, we will run all shards in parallel because the host have 8 GPUs.
if [[ $commands == *"--shard-id="* ]]; then
# assign job count as the number of shards used
commands=${commands//"--num-shards= "/"--num-shards=${PARALLEL_JOB_COUNT} "}
# assign job count as the number of shards used
commands=$(echo "$commands" | sed -E "s/--num-shards[[:blank:]]*=[[:blank:]]*[0-9]*/--num-shards=${PARALLEL_JOB_COUNT} /g" | sed 's/ \\ / /g')
for GPU in $(seq 0 $(($PARALLEL_JOB_COUNT-1))); do
# assign shard-id for each shard
commands_gpu=${commands//"--shard-id= "/"--shard-id=${GPU} "}
commands_gpu=$(echo "$commands" | sed -E "s/--shard-id[[:blank:]]*=[[:blank:]]*[0-9]*/--shard-id=${GPU} /g" | sed 's/ \\ / /g')
echo "Shard ${GPU} commands:$commands_gpu"
echo "Render devices: $BUILDKITE_AGENT_META_DATA_RENDER_DEVICES"
docker run \
Expand Down
64 changes: 64 additions & 0 deletions .buildkite/scripts/hardware_ci/run-cpu-test-arm.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
#!/bin/bash

# This script build the CPU docker image and run the offline inference inside the container.
# It serves a sanity check for compilation and basic model usage.
set -ex

# allow to bind to different cores
CORE_RANGE=${CORE_RANGE:-0-16}
OMP_CORE_RANGE=${OMP_CORE_RANGE:-0-16}
NUMA_NODE=${NUMA_NODE:-0}

export CMAKE_BUILD_PARALLEL_LEVEL=32

# Setup cleanup
remove_docker_container() {
set -e;
docker rm -f cpu-test-"$NUMA_NODE" || true;
}
trap remove_docker_container EXIT
remove_docker_container

# Try building the docker image
numactl -C "$CORE_RANGE" -N "$NUMA_NODE" docker build --tag cpu-test-"$NUMA_NODE" --target vllm-test -f docker/Dockerfile.cpu .

# Run the image, setting --shm-size=4g for tensor parallel.
docker run -itd --cpuset-cpus="$CORE_RANGE" --cpuset-mems="$NUMA_NODE" --entrypoint /bin/bash -v ~/.cache/huggingface:/root/.cache/huggingface --privileged=true -e HF_TOKEN --env VLLM_CPU_KVCACHE_SPACE=16 --env VLLM_CPU_CI_ENV=1 -e E2E_OMP_THREADS="$OMP_CORE_RANGE" --shm-size=4g --name cpu-test-"$NUMA_NODE" cpu-test-"$NUMA_NODE"

function cpu_tests() {
set -e
export NUMA_NODE=$2

docker exec cpu-test-"$NUMA_NODE" bash -c "
set -e
pip list"

# offline inference
docker exec cpu-test-"$NUMA_NODE" bash -c "
set -e
python3 examples/offline_inference/basic/generate.py --model facebook/opt-125m"

# Run kernel tests
docker exec cpu-test-"$NUMA_NODE" bash -c "
set -e
pytest -x -v -s tests/kernels/test_onednn.py
pytest -x -v -s tests/kernels/attention/test_cpu_attn.py"

# basic online serving
docker exec cpu-test-"$NUMA_NODE" bash -c '
set -e
VLLM_CPU_OMP_THREADS_BIND=$E2E_OMP_THREADS vllm serve meta-llama/Llama-3.2-3B-Instruct --max-model-len 2048 &
server_pid=$!
timeout 600 bash -c "until curl localhost:8000/v1/models; do sleep 1; done" || exit 1
vllm bench serve \
--backend vllm \
--dataset-name random \
--model meta-llama/Llama-3.2-3B-Instruct \
--num-prompts 20 \
--endpoint /v1/completions
kill -s SIGTERM $server_pid &'
}

# All of CPU tests are expected to be finished less than 40 mins.
export -f cpu_tests
timeout 2h bash -c "cpu_tests $CORE_RANGE $NUMA_NODE"
10 changes: 6 additions & 4 deletions .buildkite/scripts/hardware_ci/run-cpu-test-ppc64le.sh
Original file line number Diff line number Diff line change
Expand Up @@ -25,20 +25,22 @@ function cpu_tests() {

# offline inference
podman exec -it "$container_id" bash -c "
export TORCH_COMPILE_DISABLE=1
set -xve
python3 examples/offline_inference/basic/generate.py --model facebook/opt-125m" >> $HOME/test_basic.log

# Run basic model test
podman exec -it "$container_id" bash -c "
export TORCH_COMPILE_DISABLE=1
set -evx
pip install pytest pytest-asyncio einops peft Pillow soundfile transformers_stream_generator matplotlib
pip install sentence-transformers datamodel_code_generator
pip install sentence-transformers datamodel_code_generator tblib
# Note: disable Bart until supports V1
# pytest -v -s tests/models/language/generation/test_bart.py -m cpu_model
pytest -v -s tests/models/language/generation/test_common.py::test_models[False-5-32-openai-community/gpt2]
pytest -v -s tests/models/language/generation/test_common.py::test_models[False-5-32-facebook/opt-125m]
pytest -v -s tests/models/language/generation/test_common.py::test_models[False-5-32-google/gemma-1.1-2b-it]
pytest -v -s tests/models/language/generation/test_common.py::test_models[False-False-5-32-openai-community/gpt2]
pytest -v -s tests/models/language/generation/test_common.py::test_models[False-False-5-32-facebook/opt-125m]
pytest -v -s tests/models/language/generation/test_common.py::test_models[False-False-5-32-google/gemma-1.1-2b-it]
pytest -v -s tests/models/language/pooling/test_classification.py::test_models[float-jason9693/Qwen2.5-1.5B-apeach]
# TODO: Below test case tests/models/language/pooling/test_embedding.py::test_models[True-ssmits/Qwen2-7B-Instruct-embed-base] fails on ppc64le. Disabling it for time being.
# pytest -v -s tests/models/language/pooling/test_embedding.py -m cpu_model" >> $HOME/test_rest.log
Expand Down
14 changes: 7 additions & 7 deletions .buildkite/scripts/hardware_ci/run-cpu-test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,7 @@ function cpu_tests() {
# Run kernel tests
docker exec cpu-test-"$NUMA_NODE" bash -c "
set -e
pytest -x -v -s tests/kernels/attention/test_cpu_attn.py
pytest -x -v -s tests/kernels/test_onednn.py"

# Run basic model test
Expand All @@ -72,12 +73,11 @@ function cpu_tests() {
pytest -x -s -v \
tests/quantization/test_compressed_tensors.py::test_compressed_tensors_w8a8_logprobs"

# Note: disable it until supports V1
# Run AWQ test
# docker exec cpu-test-"$NUMA_NODE" bash -c "
# set -e
# VLLM_USE_V1=0 pytest -x -s -v \
# tests/quantization/test_ipex_quant.py"
# Run AWQ/GPTQ test
docker exec cpu-test-"$NUMA_NODE" bash -c "
set -e
pytest -x -s -v \
tests/quantization/test_cpu_wna16.py"

# Run multi-lora tests
docker exec cpu-test-"$NUMA_NODE" bash -c "
Expand Down Expand Up @@ -116,4 +116,4 @@ function cpu_tests() {

# All of CPU tests are expected to be finished less than 40 mins.
export -f cpu_tests
timeout 2h bash -c "cpu_tests $CORE_RANGE $NUMA_NODE"
timeout 2.5h bash -c "cpu_tests $CORE_RANGE $NUMA_NODE"
2 changes: 1 addition & 1 deletion .buildkite/scripts/hardware_ci/run-xpu-test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,6 @@ docker run \
pytest -v -s v1/worker --ignore=v1/worker/test_gpu_model_runner.py
pytest -v -s v1/structured_output
pytest -v -s v1/spec_decode --ignore=v1/spec_decode/test_max_len.py --ignore=v1/spec_decode/test_tree_attention.py --ignore=v1/spec_decode/test_speculators_eagle3.py
pytest -v -s v1/kv_connector/unit --ignore=v1/kv_connector/unit/test_multi_connector.py --ignore=v1/kv_connector/unit/test_nixl_connector.py --ignore=v1/kv_connector/unit/test_shared_storage_connector.py
pytest -v -s v1/kv_connector/unit --ignore=v1/kv_connector/unit/test_multi_connector.py --ignore=v1/kv_connector/unit/test_nixl_connector.py --ignore=v1/kv_connector/unit/test_shared_storage_connector.py --ignore=v1/kv_connector/unit/test_lmcache_integration.py
pytest -v -s v1/test_serial_utils.py
'
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,17 @@ wait_for_server() {
}

MODEL="deepseek-ai/DeepSeek-V2-lite"
BACKENDS=("deepep_high_throughput" "deepep_low_latency")

# Set BACKENDS based on platform
if command -v rocm-smi &> /dev/null || [[ -d /opt/rocm ]] || [[ -n "${ROCM_PATH:-}" ]]; then
# ROCm platform
BACKENDS=("allgather_reducescatter")
# Disable MOE padding for ROCm since it is causing eplb to fail
export VLLM_ROCM_MOE_PADDING=0
else
# Non-ROCm platform (CUDA/other)
BACKENDS=("deepep_high_throughput" "deepep_low_latency")
fi

cleanup() {
if [[ -n "${SERVER_PID:-}" ]] && kill -0 "${SERVER_PID}" 2>/dev/null; then
Expand Down
Original file line number Diff line number Diff line change
@@ -1,10 +1,12 @@
#!/usr/bin/env bash
set -euxo pipefail

# args: [THRESHOLD] [NUM_QUESTIONS] [START_PORT]
# args: [THRESHOLD] [NUM_QUESTIONS] [START_PORT] [DATA_PARALLEL_SIZE] [TENSOR_PARALLEL_SIZE]
THRESHOLD=${1:-0.8}
NUM_Q=${2:-1319}
PORT=${3:-8020}
DATA_PARALLEL_SIZE=${4:-2}
TENSOR_PARALLEL_SIZE=${5:-2}
OUT_DIR=${OUT_DIR:-/tmp/vllm-scheduled}
mkdir -p "${OUT_DIR}"

Expand All @@ -17,7 +19,16 @@ wait_for_server() {
}

MODEL="QWen/Qwen3-30B-A3B-FP8"
BACKENDS=("deepep_high_throughput" "deepep_low_latency")
# Set BACKENDS based on platform
if command -v rocm-smi &> /dev/null || [[ -d /opt/rocm ]] || [[ -n "${ROCM_PATH:-}" ]]; then
# ROCm platform
BACKENDS=("allgather_reducescatter")
# Disable MOE padding for ROCm since it is causing eplb to fail
export VLLM_ROCM_MOE_PADDING=0
else
# Non-ROCm platform (CUDA/other)
BACKENDS=("deepep_high_throughput" "deepep_low_latency")
fi

cleanup() {
if [[ -n "${SERVER_PID:-}" ]] && kill -0 "${SERVER_PID}" 2>/dev/null; then
Expand All @@ -36,8 +47,10 @@ for BACK in "${BACKENDS[@]}"; do
VLLM_ALL2ALL_BACKEND=$BACK \
vllm serve "$MODEL" \
--enforce-eager \
--tensor-parallel-size 2 \
--data-parallel-size 2 \
--enable-eplb \
--eplb-config '{"window_size":10, "step_interval":100, "num_redundant_experts":0, "log_balancedness":true}' \
--tensor-parallel-size ${TENSOR_PARALLEL_SIZE} \
--data-parallel-size ${DATA_PARALLEL_SIZE} \
--enable-expert-parallel \
--trust-remote-code \
--max-model-len 2048 \
Expand Down
Loading