Skip to content

cuda : fix multi-seq, quantized FA #14820

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

ggerganov
Copy link
Member

target #14756

Relax the requirement for contiguously allocated K/V buffers in the quantized case.

I am not 100% this is the most optimal solution in terms of memory usage, but at least the results are OK now.

@github-actions github-actions bot added testing Everything test related Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Jul 22, 2025
@ggerganov ggerganov mentioned this pull request Jul 22, 2025
23 tasks
@JohannesGaessler
Copy link
Collaborator

I have something WIP to fix this by adding non-contiguous support to the dequantization kernels. There is still a bug somewhere, I'll try to make a PR this evening.

I am not 100% this is the most optimal solution in terms of memory usage, but at least the results are OK now.

This is more an issue with kernel launch overhead because you're launching one kernel per sequence + each kernel will have poor hardware utilization.

@ggerganov
Copy link
Member Author

Replaced by #14822

@ggerganov ggerganov closed this Jul 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs testing Everything test related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants