KV cache memory is a bottleneck at long context lengths, especially for self-hosted deployments on consumer hardware. NexusQuant offers training-free 7–10x KV compression via E8 lattice quantization + attention-aware token eviction (up to 17x with token merging).
Integration points:
- After prefill, compress the KV cache in-place
- Use attention mask to exclude evicted tokens during generation
- API:
with nexusquant_evict(model): model.generate(...)
Why this matters for LocalAI:
LocalAI users often run large models on hardware with limited VRAM. At 10x KV compression, a model that normally supports 8K context could handle 80K+ context in the same memory budget. This directly enables longer conversations, larger documents, and RAG over more chunks without GPU upgrades.
Validated results:
- Mistral-7B: 7x compression, -2.26% PPL
- Llama-3-8B: 5.3x compression, -0.002% PPL
- Training-free, no calibration data required
Library details:
Would you be interested in exploring this as an optional compression backend? It could be exposed as a backend option (e.g., kv_compression: nexusquant) in the model YAML config. Happy to help with the integration.
KV cache memory is a bottleneck at long context lengths, especially for self-hosted deployments on consumer hardware. NexusQuant offers training-free 7–10x KV compression via E8 lattice quantization + attention-aware token eviction (up to 17x with token merging).
Integration points:
with nexusquant_evict(model): model.generate(...)Why this matters for LocalAI:
LocalAI users often run large models on hardware with limited VRAM. At 10x KV compression, a model that normally supports 8K context could handle 80K+ context in the same memory budget. This directly enables longer conversations, larger documents, and RAG over more chunks without GPU upgrades.
Validated results:
Library details:
pip install nexusquant-kvWould you be interested in exploring this as an optional compression backend? It could be exposed as a backend option (e.g.,
kv_compression: nexusquant) in the model YAML config. Happy to help with the integration.