Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix loading KV quantization scale; Enable modelopt kv cache #4686

Open
wants to merge 22 commits into
base: main
Choose a base branch
from

Conversation

yundai424
Copy link
Collaborator

@yundai424 yundai424 commented Mar 23, 2025

Motivation

Currently kv cache quantization is not usable because 1) we never pass layer.k/v_scale to attention backend, 2) we never load the precomputed kv scale.

TODO: kv cache scale is currently only supported in flashinfer backend.

Modifications

  1. Add logics to load precomputed kv scale if present in modelopt
  2. pass RadixAttention's k/v scale to flashinfer backend

Checklist

@yundai424 yundai424 marked this pull request as ready for review March 27, 2025 09:02
@yundai424 yundai424 changed the title [WIP] fix loading KV quantization scale; Enable modelopt kv cache Fix loading KV quantization scale; Enable modelopt kv cache Mar 27, 2025
@yundai424 yundai424 changed the title Fix loading KV quantization scale; Enable modelopt kv cache [WIP] Fix loading KV quantization scale; Enable modelopt kv cache Mar 27, 2025
@yundai424 yundai424 changed the title [WIP] Fix loading KV quantization scale; Enable modelopt kv cache Fix loading KV quantization scale; Enable modelopt kv cache Mar 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants