Support NexusQuant KV cache compression for memory reduction

KV cache memory is a bottleneck at long context lengths, especially for self-hosted deployments on consumer hardware. [NexusQuant](https://github.com/jagmarques/nexusquant) offers training-free 7–10x KV compression via E8 lattice quantization + attention-aware token eviction (up to 17x with token merging).

**Integration points:**
- After prefill, compress the KV cache in-place
- Use attention mask to exclude evicted tokens during generation
- API: `with nexusquant_evict(model): model.generate(...)`

**Why this matters for LocalAI:**
LocalAI users often run large models on hardware with limited VRAM. At 10x KV compression, a model that normally supports 8K context could handle 80K+ context in the same memory budget. This directly enables longer conversations, larger documents, and RAG over more chunks without GPU upgrades.

**Validated results:**
- Mistral-7B: 7x compression, -2.26% PPL
- Llama-3-8B: 5.3x compression, -0.002% PPL
- Training-free, no calibration data required

**Library details:**
- pip-installable: `pip install nexusquant-kv`
- Apache 2.0 licensed
- Paper: https://github.com/jagmarques/nexusquant/blob/main/paper/nexusquant.pdf

Would you be interested in exploring this as an optional compression backend? It could be exposed as a backend option (e.g., `kv_compression: nexusquant`) in the model YAML config. Happy to help with the integration.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support NexusQuant KV cache compression for memory reduction #9267

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

Support NexusQuant KV cache compression for memory reduction #9267

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions