Awesome-KV-Cache-Management
📢 New Benchmark Released (2025-02-18): "Exposing Numeracy Gaps: A Benchmark to Evaluate Fundamental Numerical Abilities in Large Language Models [PDF] [Dataset] " — proposing a long NumericBench to assess LLMs' numerical reasoning! 🚀
A Survey on Large Language Model Acceleration based on KV Cache Management [PDF]
Haoyang Li 1 , Yiming Li 2 , Anxin Tian 2 , Tinahao Tang 2 , Zhanchao Xu 4 , Xuejia Chen 4 , Nicole Hu 3 , Wei Dong 5 , Qing Li 1 , Lei Chen 2
1 Hong Kong Polytechnic University, 2 Hong Kong University of Science and Technology, 3 The Chinese University of Hong Kong, 4 Huazhong University of Science and Technology, 5 Nanyang Technological University.
This repository is dedicated to recording KV Cache Management papers for LLM acceleration. The survey will be updated regularly. If you find this survey helpful for your work, please consider citing it.
@article{li2024surveylargelanguagemodel,
title={A Survey on Large Language Model Acceleration based on KV Cache Management},
author={Haoyang Li and Yiming Li and Anxin Tian and Tianhao Tang and Zhanchao Xu and Xuejia Chen and Nicole Hu and Wei Dong and Qing Li and Lei Chen},
journal={arXiv preprint arXiv:2412.19442},
year={2024}
}
If you would like to include your paper or any modifications in this survey and repository, please feel free to send email to ([email protected] ) or open an issue with your paper's title, category, and a brief summary highlighting its key techniques. Thank you!
Year
Title
Type
Venue
Paper
code
2024
Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs
Static KV Cache Selection
ICLR
Link
2024
SnapKV: LLM Knows What You are Looking for Before Generation
Static KV Cache Selection
NeurIPS
Link
Link
2024
In-context KV-Cache Eviction for LLMs via Attention-Gate
Static KV Cache Selection
arXiv
Link
Dynamic Selection with Permanent Eviction (To Top👆🏻 )
Year
Title
Type
Venue
Paper
code
2024
Keyformer: KV Cache Reduction through Key Tokens Selection for Efficient Generative Inference
Dynamic Selection with Permanent Eviction
MLSys
Link
2024
BUZZ: Beehive-structured Sparse KV Cache with Segmented Heavy Hitters for Efficient LLM Inference
Dynamic Selection with Permanent Eviction
arXiv
Link
Link
2024
NACL: A General and Effective KV Cache Eviction Framework for LLMs at Inference Time
Dynamic Selection with Permanent Eviction
ACL
Link
Link
2023
H2O: heavy-hitter oracle for efficient generative inference of large language models
Dynamic Selection with Permanent Eviction
NeurIPS
Link
Link
2023
Scissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression at Test Time
Dynamic Selection with Permanent Eviction
NeurIPS
Link
Dynamic Selection without Permanent Eviction (To Top👆🏻 )
Year
Title
Type
Venue
Paper
code
2024
InfLLM: Training-Free Long-Context Extrapolation for LLMs with an Efficient Context Memory
Dynamic Selection without Permanent Eviction
arXiv
Link
Link
2024
Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference
Dynamic Selection without Permanent Eviction
ICML
Link
Link
2024
PQCache: Product Quantization-based KVCache for Long Context LLM Inference
Dynamic Selection without Permanent Eviction
arXiv
Link
2024
Squeezed Attention: Accelerating Long Context Length LLM Inference
Dynamic Selection without Permanent Eviction
arXiv
Link
Link
2024
RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval
Dynamic Selection without Permanent Eviction
arXiv
Link
Link
2024
Human-like Episodic Memory for Infinite Context LLMs
Dynamic Selection without Permanent Eviction
arXiv
Link
2024
ClusterKV: Manipulating LLM KV Cache in Semantic Space for Recallable Compression
Dynamic Selection without Permanent Eviction
arXiv
Link
KV Cache Budget Allocation
Year
Title
Type
Venue
Paper
code
2024
PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling
Layer-wise Budget Allocation
arXiv
Link
Link
2024
PyramidInfer: Pyramid KV Cache Compression for High-throughput LLM Inference
Layer-wise Budget Allocation
Findings
Link
Link
2024
DynamicKV: Task-Aware Adaptive KV Cache Compression for Long Context LLMs
Layer-wise Budget Allocation
ICLR sub.
Link
2024
PrefixKV: Adaptive Prefix KV Cache is What Vision Instruction-Following Models Need for Efficient Generation
Layer-wise Budget Allocation
arXiv
Link
Link
2024
SimLayerKV: A Simple Framework for Layer-Level KV Cache Reduction
Layer-wise Budget Allocation
arXiv
Link
Link
Year
Title
Type
Venue
Paper
code
2024
Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference
Head-wise Budget Allocation
arXiv
Link
2024
Identify Critical KV Cache in LLM Inference from an Output Perturbation Perspective
Head-wise Budget Allocation
ICLR sub.
Link
2024
Unifying KV Cache Compression for Large Language Models with LeanKV
Head-wise Budget Allocation
arXiv
Link
2024
RazorAttention: Efficient KV Cache Compression Through Retrieval Heads
Head-wise Budget Allocation
arXiv
Link
2024
Not All Heads Matter: A Head-Level KV Cache Compression Method with Integrated Retrieval and Reasoning
Head-wise Budget Allocation
arXiv
Link
Link
2024
DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads
Head-wise Budget Allocation
arXiv
Link
Link
Year
Title
Type
Venue
Paper
code
2024
Compressed Context Memory for Online Language Model Interaction
Intra-layer Merging
ICLR
Link
Link
2024
LoMA: Lossless Compressed Memory Attention
Intra-layer Merging
arXiv
Link
2024
Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference
Intra-layer Merging
ICML
Link
Link
2024
CaM: Cache Merging for Memory-efficient LLMs Inference
Intra-layer Merging
ICML
Link
Link
2024
D2O: Dynamic Discriminative Operations for Efficient Generative Inference of Large Language Models
Intra-layer Merging
arXiv
Link
2024
AIM: Adaptive Inference of Multi-Modal LLMs via Token Merging and Pruning
Intra-layer Merging
arXiv
Link
Link
2024
LOOK-M: Look-Once Optimization in KV Cache for Efficient Multimodal Long-Context Inference
Intra-layer Merging
EMNLP
Link
Link
2024
Model Tells You Where to Merge: Adaptive KV Cache Merging for LLMs on Long-Context Tasks
Intra-layer Merging
arXiv
Link
2024
CHAI: Clustered Head Attention for Efficient LLM Inference
Intra-layer Merging
arXiv
Link
Year
Title
Type
Venue
Paper
code
2024
MiniCache: KV Cache Compression in Depth Dimension for Large Language Models
Cross-layer Merging
arXiv
Link
Link
2024
KVSharer: Efficient Inference via Layer-Wise Dissimilar KV Cross-Layer Sharing
Cross-layer Merging
arXiv
Link
Link
Year
Title
Type
Venue
Paper
code
2024
QJL: 1-Bit Quantized JL Transform for KV Cache Quantization with Zero Overhead
Fixed-precision Quantization
arXiv
Link
Link
2024
PQCache: Product Quantization-based KVCache for Long Context LLM Inference
Fixed-precision Quantization
arXiv
Link
2023
FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU
Fixed-precision Quantization
ICML
Link
Link
2022
ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers
Fixed-precision Quantization
NIPS
Link
Link
Year
Title
Type
Venue
Paper
code
2024
KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization
Mixed-precision Quantization
arXiv
Link
Link
2024
IntactKV: Improving Large Language Model Quantization by Keeping Pivot Tokens Intact
Mixed-precision Quantization
arXiv
Link
Link
2024
SKVQ: Sliding-window Key and Value Cache Quantization for Large Language Models
Mixed-precision Quantization
arXiv
Link
Link
2024
KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache
Mixed-precision Quantization
arXiv
Link
Link
2024
WKVQuant: Quantizing Weight and Key/Value Cache for Large Language Models Gains More
Mixed-precision Quantization
arXiv
Link
2024
GEAR: An Efficient KV Cache Compression Recipe for Near-Lossless Generative Inference of LLM
Mixed-precision Quantization
arXiv
Link
Link
2024
No Token Left Behind: Reliable KV Cache Compression via Importance-Aware Mixed Precision Quantization
Mixed-precision Quantization
arXiv
Link
2024
ZipVL: Efficient Large Vision-Language Models with Dynamic Token Sparsification
Mixed-precision Quantization
arXiv
Link
2024
ZipCache: Accurate and Efficient KV Cache Quantization with Salient Token Identification
Mixed-precision Quantization
arXiv
Link
Link
2024
PrefixQuant: Static Quantization Beats Dynamic through Prefixed Outliers in LLMs
Mixed-precision Quantization
arXiv
Link
Link
2024
MiniKV: Pushing the Limits of LLM Inference via 2-Bit Layer-Discriminative KV Cache
Mixed-precision Quantization
arXiv
Link
Year
Title
Type
Venue
Paper
code
2024
Massive Activations in Large Language Models
Outlier Redistribution
arXiv
Link
Link
2024
QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs
Outlier Redistribution
arXiv
Link
Link
2024
QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving
Outlier Redistribution
arXiv
Link
Link
2024
SpinQuant: LLM Quantization with Learned Rotations
Outlier Redistribution
arXiv
Link
Link
2024
DuQuant: Distributing Outliers via Dual Transformation Makes Stronger Quantized LLMs
Outlier Redistribution
NeurIPS
Link
Link
2024
SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models
Outlier Redistribution
ICML
Link
Link
2024
Outlier Suppression+: Accurate quantization of large language models by equivalent and optimal shifting and scaling
Outlier Redistribution
EMNLP
Link
Link
2024
AffineQuant: Affine Transformation Quantization for Large Language Models
Outlier Redistribution
arXiv
Link
Link
2024
FlatQuant: Flatness Matters for LLM Quantization
Outlier Redistribution
arXiv
Link
Link
2024
AWQ: Activation-aware Weight Quantization for On-Device LLM Compression and Acceleration
Outlier Redistribution
MLSys
Link
Link
2023
OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models
Outlier Redistribution
arXiv
Link
Link
2023
Training Transformers with 4-bit Integers
Outlier Redistribution
NeurIPS
Link
Link
KV Cache Low-rank Decomposition
Year
Title
Type
Venue
Paper
code
2024
Effectively Compress KV Heads for LLM
Singular Value Decomposition
arXiv
Link
2024
Eigen Attention: Attention in Low-Rank Space for KV Cache Compression
Singular Value Decomposition
arXiv
Link
Link
2024
Zero-Delay QKV Compression for Mitigating KV Cache and Network Bottlenecks in LLM Inference
Singular Value Decomposition
arXiv
Link
2024
LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy
Singular Value Decomposition
arXiv
Link
2024
ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference
Singular Value Decomposition
arXiv
Link
Link
2024
Palu: Compressing KV-Cache with Low-Rank Projection
Singular Value Decomposition
arXiv
Link
Link
Year
Title
Type
Venue
Paper
code
2024
Unlocking Data-free Low-bit Quantization with Matrix Decomposition for KV Cache Compression
Tensor Decomposition
ACL
Link
Link
Learned Low-rank Approximation (To Top👆🏻 )
Year
Title
Type
Venue
Paper
code
2024
Get More with LESS: Synthesizing Recurrence with KV Cache Compression for Efficient LLM Inference
Learned Low-rank Approximation
arXiv
Link
Link
2024
MatryoshkaKV: Adaptive KV Compression via Trainable Orthogonal Projection
Learned Low-rank Approximation
arXiv
Link
Attention Grouping and Sharing
Year
Title
Type
Venue
Paper
code
2019
Fast Transformer Decoding: One Write-Head is All You Need
Intra-Layer Grouping
arXiv
Link
2023
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints
Intra-Layer Grouping
EMNLP
Link
Link
2024
Optimised Grouped-Query Attention Mechanism for Transformers
Intra-Layer Grouping
ICML
Link
2024
Weighted Grouped Query Attention in Transformers
Intra-Layer Grouping
arXiv
Link
2024
QCQA: Quality and Capacity-aware grouped Query Attention
Intra-Layer Grouping
arXiv
Link
Non-official Link
2024
Beyond Uniform Query Distribution: Key-Driven Grouped Query Attention
Intra-Layer Grouping
arXiv
Link
Link
2023
GQKVA: Efficient Pre-training of Transformers by Grouping Queries, Keys, and Values
Intra-Layer Grouping
NeurIPS
Link
Year
Title
Type
Venue
Paper
code
2024
Reducing Transformer Key-Value Cache Size with Cross-Layer Attention
Cross-Layer Sharing
arXiv
Link
Non-official Link
2024
Layer-Condensed KV Cache for Efficient Inference of Large Language Models
Cross-Layer Sharing
ACL
Link
Link
2024
Beyond KV Caching: Shared Attention for Efficient LLMs
Cross-Layer Sharing
arXiv
Link
Link
2024
MLKV: Multi-Layer Key-Value Heads for Memory Efficient Transformer Decoding
Cross-Layer Sharing
arXiv
Link
Link
2024
Cross-layer Attention Sharing for Large Language Models
Cross-Layer Sharing
arXiv
Link
2024
A Systematic Study of Cross-Layer KV Sharing for Efficient LLM Inference
Cross-Layer Sharing
arXiv
Link
2024
Lossless KV Cache Compression to 2%
Cross-Layer Sharing
arXiv
Link
2024
DHA: Learning Decoupled-Head Attention from Transformer Checkpoints via Adaptive Heads Fusion
Cross-Layer Sharing
NeurIPS
Link
2024
Value Residual Learning For Alleviating Attention Concentration In Transformers
Cross-Layer Sharing
arXiv
Link
Link
Year
Title
Type
Venue
Paper
code
2024
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
Enhanced Attention
arXiv
Link
Link
2022
Transformer Quality in Linear Time
Enhanced Attention
ICML
Link
2024
Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention
Enhanced Attention
arXiv
Link
Year
Title
Type
Venue
Paper
code
2024
You Only Cache Once: Decoder-Decoder Architectures for Language Models
Augmented Architecture
arXiv
Link
Link
2024
Long-Context Language Modeling with Parallel Context Encoding
Augmented Architectures
ACL
Link
Link
2024
XC-CACHE: Cross-Attending to Cached Context for Efficient LLM Inference
Augmented Architectures
Findings
Link
2024
Block Transformer: Global-to-Local Language Modeling for Fast Inference
Augmented Architectures
arXiv
Link
Link
Non-transformer Architecture
Adaptive Sequence Processing Architecture (To Top👆🏻 )
Year
Title
Type
Venue
Paper
code
2023
RWKV: Reinventing RNNs for the Transformer Era
Adaptive Sequence Processing Architecture
Findings
Link
Link
2024
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Adaptive Sequence Processing Architecture
arXiv
Link
Link
2023
Retentive Network: A Successor to Transformer for Large Language Models
Adaptive Sequence Processing Architecture
arXiv
Link
Link
2024
MCSD: An Efficient Language Model with Diverse Fusion
Adaptive Sequence Processing Architecture
arXiv
Link
Year
Title
Type
Venue
Paper
code
2024
MixCon: A Hybrid Architecture for Efficient and Adaptive Sequence Modeling
Hybrid Architecture
IOS Press
Link
2024
GoldFinch: High Performance RWKV/Transformer Hybrid with Linear Pre-Fill and Extreme KV-Cache Compression
Hybrid Architecture
arXiv
Link
Link
2024
RecurFormer: Not All Transformer Heads Need Self-Attention
Hybrid Architecture
arXiv
Link
System-level Optimization
Year
Title
Type
Venue
Paper
code
2024
vTensor: Flexible Virtual Tensor Management for Efficient LLM Serving
Architectural Design
arXiv
Link
Link
2024
Unifying KV Cache Compression for Large Language Models with LeanKV
Architectural Design
arXiv
Link
2023
Efficient Memory Management for Large Language Model Serving with PagedAttention
Architectural Design
SOSP
Link
Link
Year
Title
Type
Venue
Paper
code
2024
ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and Two-Phase Partition
Prefix-aware Design
ACL
Link
Link
2024
MemServe:FlexibleMemPoolforBuilding DisaggregatedLLMServingwithCaching
Prefix-aware Design
arXiv
Link
Year
Title
Type
Venue
Paper
code
2024
BatchLLM: Optimizing Large Batched LLM Inference with Global Prefix Sharing and Throughput-oriented Token Batching
Prefix-aware Scheduling
arXiv
Link
2024
SGLang: Efficient Execution of Structured Language Model Programs
Prefix-aware Scheduling
NeurIPS
Link
Link
Preemptive and Fairness-oriented Scheduling (To Top👆🏻 )
Year
Title
Type
Venue
Paper
code
2024
Fast Distributed Inference Serving for Large Language Models
Preemptive and Fairness-oriented Scheduling
arXiv
Link
2024
FASTSWITCH: OPTIMIZING CONTEXT SWITCHING EFFICIENCY IN FAIRNESS-AWARE LARGE LANGUAGE MODEL SERVING
Preemptive and Fairness-oriented Scheduling
arXiv
Link
Layer-specific and Hierarchical Scheduling (To Top👆🏻 )
Year
Title
Type
Venue
Paper
code
2024
LayerKV: Optimizing Large Language Model Serving with Layer-wise KV Cache Management
Layer-specific and Hierarchical Scheduling
arXiv
Link
Link
2024
Cost-Efficient Large Language Model Serving for Multi-turn Conversations with CachedAttention
Layer-specific and Hierarchical Scheduling
USENIX ATC
Link
2024
ALISA: Accelerating Large Language Model Inference via Sparsity-Aware KV Caching
Layer-specific and Hierarchical Scheduling
ISCA
Link
2024
Fast Inference for Augmented Large Language Models
Layer-specific and Hierarchical Scheduling
arXiv
Link
Year
Title
Type
Venue
Paper
code
2024
Hydragen: High-Throughput LLM Inference with Shared Prefixes
Single/Multi-GPU Design
arXiv
Link
Link
2024
DeFT: Decoding with Flash Tree-attention for Efficient Tree-structured LLM Inference
Single/Multi-GPU Design
arXiv
Link
2024
DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving
Single/Multi-GPU Design
OSDI
Link
Link
2024
Multi-Bin Batching for Increasing LLM Inference Throughput
Single/Multi-GPU Design
arXiv
Link
2024
Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clusters
Single/Multi-GPU Design
arXiv
Link
Link
2023
Efficient Memory Management for Large Language Model Serving with PagedAttention
Single/Multi-GPU Design
SOSP
Link
Link
2022
Orca: A Distributed Serving System for Transformer-Based Generative Models
Single/Multi-GPU Design
OSDI
Link
Year
Title
Type
Venue
Paper
code
2024
Bifurcated Attention: Accelerating Massively Parallel Decoding with Shared Prefixes in LLMs
I/O-based Design
arXiv
Link
Link
2024
Efficient LLM Inference with I/O-Aware Partial KV Cache Recomputation
I/O-based Design
arXiv
Link
2024
Fast State Restoration in LLM Serving with HCache
I/O-based Design
arXiv
Link
2024
Compute Or Load KV Cache? Why Not Both?
I/O-based Design
arXiv
Link
2024
FastSwitch: Optimizing Context Switching Efficiency in Fairness-aware Large Language Model Serving
I/O-based Design
arXiv
Link
2022
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
I/O-based Design
NeurIPS
Link
Link
Year
Title
Type
Venue
Paper
code
2024
NEO: Saving GPU Memory Crisis with CPU Offloading for Online LLM Inference
Heterogeneous Design
arXiv
Link
2024
FASTDECODE: High-Throughput GPU-Efficient LLM Serving using Heterogeneous Pipelines
Heterogeneous Design
arXiv
Link
2024
vTensor: Flexible Virtual Tensor Management for Efficient LLM Serving
Heterogeneous Design
arXiv
Link
2024
InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management
Heterogeneous Design
arXiv
Link
2024
Fast Distributed Inference Serving for Large Language Models
Heterogeneous Design
arXiv
Link
2024
Efficient LLM Inference with I/O-Aware Partial KV Cache Recomputation
Heterogeneous Design
arXiv
Link
2023
Stateful Large Language Model Serving with Pensieve
Heterogeneous Design
arXiv
Link
Year
Title
Type
Venue
Paper
code
2024
InstInfer: In-Storage Attention Offloading for Cost-Effective Long-Context LLM Inference
SSD-based Design
arXiv
Link
2023
FlexGen: High-Throughput Generative Inference of Large Language Models
SSD-based Design
ICML
Link
Link
Please refer to our paper for detailed information on this section.