Fix loading KV quantization scale; Enable modelopt kv cache #4686

yundai424 · 2025-03-23T00:25:15Z

Motivation

Currently kv cache quantization is not usable because 1) we never pass layer.k/v_scale to attention backend, 2) we never load the precomputed kv scale.

TODO: kv cache scale is currently only supported in flashinfer backend.

Modifications

Add logics to load precomputed kv scale if present in modelopt
pass RadixAttention's k/v scale to flashinfer backend

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

yundai424 and others added 12 commits March 19, 2025 20:38

fix scalar weight scale

6af35fa

update

471933f

test

f0a7949

add test

1c13eb0

fix

fdcc2cd

Fix typo in TODO comment

02dd917

Merge branch 'main' into scalar_weight_scale

944a724

wip

b0cbbc5

wip

f72f63f

kv scales passed but accuracy craps out

bd80546

Merge branch 'main' into kv_cache_quant

f107edd

refactor

19933b6

yundai424 force-pushed the kv_cache_quant branch from 9a85126 to 19933b6 Compare March 23, 2025 00:25

zhyncs added the high priority label Mar 23, 2025

yundai424 and others added 4 commits March 24, 2025 22:02

Merge branch 'main' into kv_cache_quant

58ee8f8

Merge branch 'main' into kv_cache_quant

2e7840a

Merge branch 'main' into kv_cache_quant

075ad51

add models

d1a8de4

yundai424 marked this pull request as ready for review March 27, 2025 09:02

yundai424 requested review from merrymercy, Ying1123, zhyncs, hnyls2002, ispobock, ByronHsu and HaiShaw as code owners March 27, 2025 09:02

yundai424 changed the title ~~[WIP] fix loading KV quantization scale; Enable modelopt kv cache~~ Fix loading KV quantization scale; Enable modelopt kv cache Mar 27, 2025

add test

5fea51d

yundai424 force-pushed the kv_cache_quant branch from df5713c to 5fea51d Compare March 27, 2025 09:04

fix test

1fbd6c4

yundai424 added 3 commits March 27, 2025 18:55

fix

7a479a8

Merge branch 'main' into kv_cache_quant

1b3c994

FA3

26b6fc9

yundai424 changed the title ~~Fix loading KV quantization scale; Enable modelopt kv cache~~ [WIP] Fix loading KV quantization scale; Enable modelopt kv cache Mar 27, 2025

zhyncs mentioned this pull request Mar 27, 2025

[Roadmap] FlashAttention3 Support as SGLang Attention Backend #4709

Open

11 tasks

fix

5e13a43

yundai424 changed the title ~~[WIP] Fix loading KV quantization scale; Enable modelopt kv cache~~ Fix loading KV quantization scale; Enable modelopt kv cache Mar 28, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix loading KV quantization scale; Enable modelopt kv cache #4686

Fix loading KV quantization scale; Enable modelopt kv cache #4686

yundai424 commented Mar 23, 2025 •

edited

Loading

Fix loading KV quantization scale; Enable modelopt kv cache #4686

Are you sure you want to change the base?

Fix loading KV quantization scale; Enable modelopt kv cache #4686

Conversation

yundai424 commented Mar 23, 2025 • edited Loading

Motivation

Modifications

Checklist

yundai424 commented Mar 23, 2025 •

edited

Loading