Llama-batched-bench shows nice speedsup increasing the number of parallel requests, but doesn't translate to llama-server #15290

Mushoz · 2025-08-13T08:48:33Z

Mushoz
Aug 13, 2025

I am benching gpt-oss-120b with llama-batched-bench and I am seeing very nice speedups all to way to batchsize of 32. However, I am not seeing those same improvements with llama-server. As a matter of fact, the aggregate token generation speed actually drops as I increase concurrency and only recovers at batchsize=16 or higher. I am using the following tool to test concurrent requests: https://github.com/Yoosu-L/llmapibenchmark

I am using LLAMA_SET_ROWS=1 for the split KV-cache. I AM getting a warning about not using swa-full, so I am not sure if that is related, but llama-batched bench didn't require any changes in that department to see nice speedups.

I am using the Vulkan backend on a gfx1151 Strix Halo API. I am testing with the pro, radv and amdvlk drivers on Linux with similar results. Any ideas?

ggerganov · 2025-08-13T09:02:07Z

ggerganov
Aug 13, 2025
Maintainer

Can you post the llama-batched-bench and llama-server commands that you used and the results that you got?

3 replies

Mushoz Aug 13, 2025
Author

Sure!

llama-batched-bench -m .cache/llama.cpp/unsloth_gpt-oss-120b-GGUF_Q8_0_gpt-oss-120b-Q8_0-00001-of-00002.gguf -fa --no-mmap -npp 512 -ntg 128 -npl 1,2,4,8,16,32,64 -c 0

This gives me the following result:

PP	TG	B	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s	T s	S t/s
512	128	1	640	0.822	622.55	2.826	45.30	3.648	175.44
512	128	2	1280	1.478	692.87	9.729	26.31	11.207	114.21
512	128	4	2560	2.894	707.67	11.515	44.46	14.409	177.66
512	128	8	5120	5.871	697.71	14.072	72.77	19.943	256.74
512	128	16	10240	11.781	695.38	10.267	199.48	22.047	464.46
512	128	32	20480	23.669	692.22	11.161	367.00	34.829	588.01
512	128	64	40960	47.419	691.03	13.326	614.72	60.746	674.29

Then these are the two commands that I use to bench llama-server:

LLAMA_SET_ROWS=1 llama-server -m .cache/llama.cpp/unsloth_gpt-oss-120b-GGUF_Q8_0_gpt-oss-120b-Q8_0-00001-of-00002.gguf -fa --no-mmap -ngl 999 --host 0.0.0.0 --port 9000 -c 0 -np 64 --slots --metrics --no-context-shift --jinja

./llmapibenchmark_linux_amd64 -apikey empty -base_url http://127.0.0.1:9000/v1 -concurrency 1,2,4,8,16,32,64 -prompt "Please write a minesweeper game using html, css and js in three separate files." -max_tokens 128

Which gives me the following result (Note, generation throughput is estimated, but matches closely with what I see in the server's logs. Prompt processing numbers can be off completely, probably due to prompt caching):

Concurrency	Generation Throughput (tokens/s)	Prompt Throughput (tokens/s)	Min TTFT (s)	Max TTFT (s)
1	36.86	3665.97	0.03	0.03
2	43.04	279.67	0.03	0.70
4	21.62	253.88	0.75	1.54
8	27.96	484.67	1.61	1.62
16	53.21	469.51	1.36	3.34
32	80.74	499.83	0.85	6.27
64	126.12	7315.49	0.70	0.86

I have also repeated the above setup, but with --swa-full added to the server command, but scaling remained the same as what it was before.

I am especially puzzled with discrepancies between how each benchmark scales. For example, I see a big drop in token generation performance at batchsize 2 with llama-batched-bench which recovers at batchsize 4, yet with llama-server batchsize=2 benches slightly higher than the single request, but it has a drop at 4 and 8 instead, which only recovers at 16. Could it be that both implementations are using different code pathways entirely? Or they batch requests differently between both implementations?

ggerganov Aug 13, 2025
Maintainer

The commands look OK and the llama-batched-bench numbers look reasonable.

I am not sure how the bench tool that you use works - there are many details that could have a large impact on the measurements and without knowing how it works, it's hard to say if these numbers make sense.

You can look into the recent work by @JohannesGaessler for benchmarking the parallel performance of llama-server and try to adapt the methodology for you use case:

But in general keep in mind that with MoE models we still haven't optimized the MUL_MAT_ID performance for small batch sizes larger than 1. There are large gains that are currently missing for MoE. Not sure what is the state of the Vulkan backend, but most likely it is also not very optimized for that, similar to the rest of the backends (e.g. CUDA, Metal). So whatever numbers you get for more than 1 concurrent request, these should get significantly improved in the future.

Mushoz Aug 13, 2025
Author

Using the server-bench.py script from @JohannesGaessler now. Using the following command. Hopefully it's okay how I am using it, as I am not familiar with the script :)

LLAMA_ARG_N_PARALLEL=1 LLAMA_ARG_PORT=9000 LLAMA_SET_ROWS=1 LLAMA_ARG_MODEL=/home/docker/.cache/llama.cpp/unsloth_gpt-oss-120b-GGUF_Q8_0_gpt-oss-120b-Q8_0-00001-of-00002.gguf ./scripts/server-bench.py --path_server llama-server --n_prompts 2

I then doubled the LLAMA_ARG_N_PARALLEL environment variable and the n_prompts (I know ideally you pick one big n_prompts number to works for all values for LLAMA_ARG_N_PARALLEL to keep the prompts identical between each run, but that would take super long. Just wanted to do a quick check what the scaling looks like roughly) switch each time to get the following results:

1 thread/2 prompts:

Benchmark duration:                75.49 s
Request throughput:                0.03 requests/s = 1.59 requests/min
Total prompt length:               3095 tokens
Average prompt length:             1547.50 tokens
Average prompt latency:            3736.30 ms
Average prompt speed:              414.18 tokens/s
Total generated tokens:            2919
Average generation depth:          2257.49 tokens
Average total generation speed:    38.67 tokens/s
Average generation speed per slot: 38.67 tokens/s / slot

2 threads/4 prompts:

Benchmark duration:                318.20 s
Request throughput:                0.01 requests/s = 0.75 requests/min
Total prompt length:               6203 tokens
Average prompt length:             1550.75 tokens
Average prompt latency:            4626.65 ms
Average prompt speed:              335.18 tokens/s
Total generated tokens:            5720
Average generation depth:          2253.70 tokens
Average total generation speed:    17.98 tokens/s
Average generation speed per slot: 8.99 tokens/s / slot

4 threads/8 prompts:

Benchmark duration:                580.22 s
Request throughput:                0.01 requests/s = 0.83 requests/min
Total prompt length:               12209 tokens
Average prompt length:             1526.12 tokens
Average prompt latency:            8014.77 ms
Average prompt speed:              190.41 tokens/s
Total generated tokens:            11431
Average generation depth:          2241.18 tokens
Average total generation speed:    19.70 tokens/s
Average generation speed per slot: 4.93 tokens/s / slot

8 threads/ 16 prompts:

Benchmark duration:                797.89 s
Request throughput:                0.02 requests/s = 1.20 requests/min
Total prompt length:               24980 tokens
Average prompt length:             1561.25 tokens
Average prompt latency:            12282.26 ms
Average prompt speed:              127.11 tokens/s
Total generated tokens:            22544
Average generation depth:          2263.88 tokens
Average total generation speed:    28.25 tokens/s
Average generation speed per slot: 3.53 tokens/s / slot

16 threads/32 prompts:

Benchmark duration:                893.36 s
Request throughput:                0.04 requests/s = 2.15 requests/min
Total prompt length:               49647 tokens
Average prompt length:             1551.47 tokens
Average prompt latency:            22592.04 ms
Average prompt speed:              68.67 tokens/s
Total generated tokens:            46324
Average generation depth:          2274.77 tokens
Average total generation speed:    51.85 tokens/s
Average generation speed per slot: 3.24 tokens/s / slot

32 threads/64 prompts:

Benchmark duration:                1632.91 s
Request throughput:                0.04 requests/s = 2.35 requests/min
Total prompt length:               100881 tokens
Average prompt length:             1576.27 tokens
Average prompt latency:            42016.21 ms
Average prompt speed:              37.52 tokens/s
Total generated tokens:            93864
Average generation depth:          2323.87 tokens
Average total generation speed:    57.48 tokens/s
Average generation speed per slot: 1.80 tokens/s / slot

64 threads/128 prompts:

Benchmark duration:                2680.69 s
Request throughput:                0.05 requests/s = 2.86 requests/min
Total prompt length:               196350 tokens
Average prompt length:             1533.98 tokens
Average prompt latency:            73619.68 ms
Average prompt speed:              20.84 tokens/s
Total generated tokens:            192932
Average generation depth:          2303.71 tokens
Average total generation speed:    71.97 tokens/s
Average generation speed per slot: 1.12 tokens/s / slot

So as can be seen, there is some scaling, but it only kicks in at high-ish backsizes and is not even comparable to the scaling what I am seeing with llama-batched-bench. Why is that? Is the llama-batched-bench doing the batching differently? Or is it running a different codepath? Or is llama-batched-bench flawed, and does it always select identical inputs/outputs per batch, resulting in identical experts being selected at each layer, making it easier to batch those calculations as well? That's the only plausible explanation I can come up with?

Djip007 · 2025-10-07T23:33:47Z

Djip007
Oct 7, 2025

some bench on IA-MAX 395 with 128Go:
⬢ [zzzz@toolbx LLM]$ ./llama-batched-bench --ctx-size 524288 -fa on --no-mmap -b 8192 -ub 4096 -m ./openai_gpt-oss-120b-MXFP4.gguf -npp 512 -ntg 128 -npl 1,2,4,8,16,32,64,128,256
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
build: 6586 (835b2b9) with cc (GCC) 15.2.1 20250924 (Red Hat 15.2.1-2) for x86_64-redhat-linux

main: n_kv_max = 524288, n_batch = 8192, n_ubatch = 4096, flash_attn = 1, is_pp_shared = 0, n_gpu_layers = -1, n_threads = 16, n_threads_batch = 16

PP	TG	B	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s	T s	S t/s
512	128	1	640	0.682	750.55	3.026	42.30	3.708	172.58
512	128	2	1280	1.161	881.73	4.669	54.83	5.830	219.55
512	128	4	2560	1.831	1118.41	6.086	84.13	7.917	323.34
512	128	8	5120	3.239	1264.73	8.500	120.47	11.738	436.18
512	128	16	10240	6.368	1286.46	13.676	149.75	20.044	510.87
512	128	32	20480	12.727	1287.30	24.977	163.99	37.705	543.17
512	128	64	40960	25.499	1285.05	44.604	183.66	70.104	584.28
512	128	128	81920	50.903	1287.46	72.243	226.79	123.146	665.23
512	128	256	163840	101.998	1285.05	124.538	263.12	226.535	723.24

on this config don't mise the --ubatch=4096 the default 512 is way to small.

For the server, did we not need to use "--parallele 256" for batching?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Llama-batched-bench shows nice speedsup increasing the number of parallel requests, but doesn't translate to llama-server #15290

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Llama-batched-bench shows nice speedsup increasing the number of parallel requests, but doesn't translate to llama-server #15290

Uh oh!

Uh oh!

Mushoz Aug 13, 2025

Replies: 2 comments · 3 replies

Uh oh!

ggerganov Aug 13, 2025 Maintainer

Uh oh!

Uh oh!

Mushoz Aug 13, 2025 Author

Uh oh!

ggerganov Aug 13, 2025 Maintainer

Uh oh!

Mushoz Aug 13, 2025 Author

Uh oh!

Djip007 Oct 7, 2025

Mushoz
Aug 13, 2025

Replies: 2 comments 3 replies

ggerganov
Aug 13, 2025
Maintainer

Mushoz Aug 13, 2025
Author

ggerganov Aug 13, 2025
Maintainer

Mushoz Aug 13, 2025
Author

Djip007
Oct 7, 2025