Request to Add CacheCraft: A Relevant Work on Chunk-Aware KV Cache Reuse for RAG

skejriwal44 · web-flow · commit c69a0a0a6da1 · 2025-03-03T23:44:08.000+05:30
Thanks for this great list! We’d love to add CacheCraft—a chunk-aware KV reuse approach for RAG that minimizes redundant computation while preserving generation quality. Our work is concurrent to CacheBlend, with key differences in chunk-level reuse, selective recompute planning, and optimizations designed for real-world production systems. CacheCraft is accepted at SIGMOD 2025.

We’re also open-sourcing a vLLM-based extension soon. Results on real RAG traces show strong efficiency gains in production. Recent works like CacheFocus and EPIC further build on related ideas, highlighting the growing relevance of this research direction.
diff --git a/README.md b/README.md
@@ -86,6 +86,7 @@ description: >-
 | [CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving](https://arxiv.org/pdf/2310.07240) | KV Cache compression | University of Chicage | sigcomm |
 | [SCOPE:OptimizingKey-Value Cache Compression in Long-context Generation](https://arxiv.org/pdf/2412.13649) | Separate handling of prefill and decoding KV Cache | SEU | arxiv 2024 |
 | [FASTDECODE: High-Throughput GPU-Efficient LLM Serving using Heterogeneous Pipelines](https://arxiv.org/pdf/2403.11421) | Heterogeneous pipelines | THU | arxiv 2024 |
+| [Cache-Craft: Managing Chunk-Caches for Efficient Retrieval-Augmented Generation](https://www.arxiv.org/pdf/2502.15734) | Approximate Chunked KV Reuse | Adobe Research | SIGMOD 2025 |