cuda : fix multi-seq, quantized FA #14820

ggerganov · 2025-07-22T17:52:40Z

Relax the requirement for contiguously allocated K/V buffers in the quantized case.

I am not 100% this is the most optimal solution in terms of memory usage, but at least the results are OK now.

ggml-ci

JohannesGaessler · 2025-07-22T18:45:49Z

I have something WIP to fix this by adding non-contiguous support to the dequantization kernels. There is still a bug somewhere, I'll try to make a PR this evening.

I am not 100% this is the most optimal solution in terms of memory usage, but at least the results are OK now.

This is more an issue with kernel launch overhead because you're launching one kernel per sequence + each kernel will have poor hardware utilization.

ggerganov · 2025-07-23T05:51:55Z

Replaced by #14822

cuda : fix multi-seq, quantized FA

55cf48d

ggml-ci

ggerganov requested a review from JohannesGaessler as a code owner July 22, 2025 17:52

github-actions bot added testing Everything test related Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Jul 22, 2025

ggerganov mentioned this pull request Jul 22, 2025

llama : add high-throughput mode #14363

Merged

23 tasks

ggerganov closed this Jul 23, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

cuda : fix multi-seq, quantized FA #14820

cuda : fix multi-seq, quantized FA #14820

Uh oh!

ggerganov commented Jul 22, 2025

Uh oh!

JohannesGaessler commented Jul 22, 2025

Uh oh!

ggerganov commented Jul 23, 2025

Uh oh!

Uh oh!

cuda : fix multi-seq, quantized FA #14820

cuda : fix multi-seq, quantized FA #14820

Uh oh!

Conversation

ggerganov commented Jul 22, 2025

Uh oh!

JohannesGaessler commented Jul 22, 2025

Uh oh!

ggerganov commented Jul 23, 2025

Uh oh!

Uh oh!