Replies: 2 comments 3 replies
-
Can you post the |
Beta Was this translation helpful? Give feedback.
-
some bench on IA-MAX 395 with 128Go: main: n_kv_max = 524288, n_batch = 8192, n_ubatch = 4096, flash_attn = 1, is_pp_shared = 0, n_gpu_layers = -1, n_threads = 16, n_threads_batch = 16
on this config don't mise the --ubatch=4096 the default 512 is way to small. For the server, did we not need to use "--parallele 256" for batching? |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I am benching gpt-oss-120b with llama-batched-bench and I am seeing very nice speedups all to way to batchsize of 32. However, I am not seeing those same improvements with llama-server. As a matter of fact, the aggregate token generation speed actually drops as I increase concurrency and only recovers at batchsize=16 or higher. I am using the following tool to test concurrent requests: https://github.com/Yoosu-L/llmapibenchmark
I am using
LLAMA_SET_ROWS=1
for the split KV-cache. I AM getting a warning about not usingswa-full
, so I am not sure if that is related, but llama-batched bench didn't require any changes in that department to see nice speedups.I am using the Vulkan backend on a gfx1151 Strix Halo API. I am testing with the pro, radv and amdvlk drivers on Linux with similar results. Any ideas?
Beta Was this translation helpful? Give feedback.
All reactions