Significant performance drop in concurrent requests with 30B model vs 4B model on 2x4090 #14913

tylike · 2025-07-28T05:49:57Z

tylike
Jul 28, 2025

Note: This issue is drafted with Claude AI's assistance for clear English expression.

Setup

Hardware: 2x RTX 4090
Command:

llama-server -m MODEL.gguf -np 2 -c 16384 -ngl 99 -ts 1,1,0,0 --host 0.0.0.0 --port 3721 -t 8 -tb 8

Issue

Using identical configuration for both models:

Qwen3-4B-bf16: 60→58 tokens/s (3.3% drop) with 2 concurrent requests
Qwen3-30B-A3B-Q8 model: 80→7 tokens/s (86% drop) with 2 concurrent requests

Questions

Is there potential for better performance with the 30B model under concurrent requests?
Which parameters should I adjust to improve concurrent processing speed?

Test Method

Simple browser-based concurrent API requests through two windows.

Answered by slaren

Jul 28, 2025

The performance of small batch sizes with MoE models is not very good in the CUDA backend. There is a fast gemv implementation, but it only works with bs=1.

  MUL_MAT_ID(type_a=q4_0,type_b=f32,n_mats=256,n_used=8,b=1,m=2048,n=1,k=7168):                14058 runs -    73.25 us/run - 234.88 MFLOP/run -   3.21 TFLOPS
  MUL_MAT_ID(type_a=q4_0,type_b=f32,n_mats=256,n_used=8,b=1,m=2048,n=2,k=7168):                 2343 runs -   432.78 us/run - 469.76 MFLOP/run -   1.09 TFLOPS
  MUL_MAT_ID(type_a=q4_0,type_b=f32,n_mats=256,n_used=8,b=1,m=2048,n=3,k=7168):                 1420 runs -   711.23 us/run - 704.64 MFLOP/run - 990.74 GFLOPS
  MUL_MAT_ID(type_a=q4_0,type_b=f32,n_mats=256,n_used=8,b=1,m=2…

View full answer

ggerganov · 2025-07-28T06:04:58Z

ggerganov
Jul 28, 2025
Maintainer

Try:

LLAMA_SET_ROWS=1 llama-server -m MODEL.gguf -np 2 -c 16384 -ngl 99 --host 0.0.0.0 --port 3721 -t 8 -tb 8 -fa

0 replies

slaren · 2025-07-28T10:26:41Z

slaren
Jul 28, 2025
Maintainer

The performance of small batch sizes with MoE models is not very good in the CUDA backend. There is a fast gemv implementation, but it only works with bs=1.

  MUL_MAT_ID(type_a=q4_0,type_b=f32,n_mats=256,n_used=8,b=1,m=2048,n=1,k=7168):                14058 runs -    73.25 us/run - 234.88 MFLOP/run -   3.21 TFLOPS
  MUL_MAT_ID(type_a=q4_0,type_b=f32,n_mats=256,n_used=8,b=1,m=2048,n=2,k=7168):                 2343 runs -   432.78 us/run - 469.76 MFLOP/run -   1.09 TFLOPS
  MUL_MAT_ID(type_a=q4_0,type_b=f32,n_mats=256,n_used=8,b=1,m=2048,n=3,k=7168):                 1420 runs -   711.23 us/run - 704.64 MFLOP/run - 990.74 GFLOPS
  MUL_MAT_ID(type_a=q4_0,type_b=f32,n_mats=256,n_used=8,b=1,m=2048,n=4,k=7168):                 1177 runs -   906.02 us/run - 939.52 MFLOP/run -   1.04 TFLOPS
  MUL_MAT_ID(type_a=q4_0,type_b=f32,n_mats=256,n_used=8,b=1,m=2048,n=8,k=7168):                  594 runs -  1686.22 us/run -   1.88 GFLOP/run -   1.11 TFLOPS

1 reply

tylike Aug 5, 2025
Author

May I ask, are there any plans to make modifications regarding this issue to improve the efficiency of Moe's concurrent requests?

tylike · 2025-07-29T01:07:40Z

tylike
Jul 29, 2025
Author

Thanks for the response. I did some further testing to confirm the behavior:

Confirmed the severe performance degradation with Qwen3 30B A3B (MoE model):
- Single request: ~80 tokens/s
- Two concurrent requests: drops to ~7 tokens/s
Tried the suggested solutions on Windows CMD:

set LLAMA_SET_ROWS=1
llama-server -m MODEL.gguf -np 2 -c 16384 -ngl 99 --host 0.0.0.0 --port 3721 -t 8 -tb 8 -fa

But didn't observe significant improvement in concurrent request performance.

I also tried vllm as an alternative, but being unfamiliar with Python environments, its deployment process on my older GPUs (Tesla V100) proved too complicated. Even after several days of attempts with AI assistance, I couldn't successfully deploy it.

llama.cpp is much simpler to work with, and I'm looking forward to improvements in concurrent processing capabilities for MoE models.

1 reply

ggerganov Jul 29, 2025
Maintainer

See slaren's comment above - the MUL_MAT_ID CUDA implementation for bs>1 would need to be updated.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Significant performance drop in concurrent requests with 30B model vs 4B model on 2x4090 #14913

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 3 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Significant performance drop in concurrent requests with 30B model vs 4B model on 2x4090 #14913

Uh oh!

Uh oh!

tylike Jul 28, 2025

Setup

Issue

Questions

Test Method

Replies: 3 comments · 2 replies

Uh oh!

ggerganov Jul 28, 2025 Maintainer

Uh oh!

slaren Jul 28, 2025 Maintainer

Uh oh!

tylike Aug 5, 2025 Author

Uh oh!

tylike Jul 29, 2025 Author

Uh oh!

ggerganov Jul 29, 2025 Maintainer

tylike
Jul 28, 2025

Replies: 3 comments 2 replies

ggerganov
Jul 28, 2025
Maintainer

slaren
Jul 28, 2025
Maintainer

tylike Aug 5, 2025
Author

tylike
Jul 29, 2025
Author

ggerganov Jul 29, 2025
Maintainer