Awesome-KV-Cache-Management

News

📢 New Benchmark Released (2025-02-18): "Exposing Numeracy Gaps: A Benchmark to Evaluate Fundamental Numerical Abilities in Large Language Models [PDF][Dataset]" — proposing a long NumericBench to assess LLMs' numerical reasoning! 🚀

A Survey on Large Language Model Acceleration based on KV Cache Management [PDF]

Haoyang Li ¹, Yiming Li ², Anxin Tian ², Tinahao Tang ², Zhanchao Xu ⁴, Xuejia Chen ⁴, Nicole Hu ³, Wei Dong ⁵, Qing Li ¹, Lei Chen ²

¹Hong Kong Polytechnic University, ²Hong Kong University of Science and Technology, ³The Chinese University of Hong Kong, ⁴Huazhong University of Science and Technology, ⁵Nanyang Technological University.

This repository is dedicated to recording KV Cache Management papers for LLM acceleration. The survey will be updated regularly. If you find this survey helpful for your work, please consider citing it.

  @article{li2024surveylargelanguagemodel,
      title={A Survey on Large Language Model Acceleration based on KV Cache Management}, 
      author={Haoyang Li and Yiming Li and Anxin Tian and Tianhao Tang and Zhanchao Xu and Xuejia Chen and Nicole Hu and Wei Dong and Qing Li and Lei Chen},
      journal={arXiv preprint arXiv:2412.19442},
      year={2024}
  }

If you would like to include your paper or any modifications in this survey and repository, please feel free to send email to (haoyang-comp.li@polyu.edu.hk) or open an issue with your paper's title, category, and a brief summary highlighting its key techniques. Thank you!

Toxonomy and Papers

Awesome-KV-Cache-Management
Token-level Optimization
Model-level Optimization
System-level Optimization
Datasets and Benchmarks

Token-level Optimization

KV Cache Selection

Static KV Cache Selection (To Top👆🏻)

Year	Title	Type	Venue	Paper	code
2024	Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs	Static KV Cache Selection	ICLR	Link
2024	SnapKV: LLM Knows What You are Looking for Before Generation	Static KV Cache Selection	NeurIPS	Link	Link
2024	In-context KV-Cache Eviction for LLMs via Attention-Gate	Static KV Cache Selection	arXiv	Link

Dynamic Selection with Permanent Eviction (To Top👆🏻)

Year	Title	Type	Venue	Paper	code
2024	Keyformer: KV Cache Reduction through Key Tokens Selection for Efficient Generative Inference	Dynamic Selection with Permanent Eviction	MLSys	Link
2024	BUZZ: Beehive-structured Sparse KV Cache with Segmented Heavy Hitters for Efficient LLM Inference	Dynamic Selection with Permanent Eviction	arXiv	Link	Link
2024	NACL: A General and Effective KV Cache Eviction Framework for LLMs at Inference Time	Dynamic Selection with Permanent Eviction	ACL	Link	Link
2023	H2O: heavy-hitter oracle for efficient generative inference of large language models	Dynamic Selection with Permanent Eviction	NeurIPS	Link	Link
2023	Scissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression at Test Time	Dynamic Selection with Permanent Eviction	NeurIPS	Link

Dynamic Selection without Permanent Eviction (To Top👆🏻)

Year	Title	Type	Venue	Paper	code
2024	InfLLM: Training-Free Long-Context Extrapolation for LLMs with an Efficient Context Memory	Dynamic Selection without Permanent Eviction	arXiv	Link	Link
2024	Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference	Dynamic Selection without Permanent Eviction	ICML	Link	Link
2024	PQCache: Product Quantization-based KVCache for Long Context LLM Inference	Dynamic Selection without Permanent Eviction	arXiv	Link
2024	Squeezed Attention: Accelerating Long Context Length LLM Inference	Dynamic Selection without Permanent Eviction	arXiv	Link	Link
2024	RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval	Dynamic Selection without Permanent Eviction	arXiv	Link	Link
2024	Human-like Episodic Memory for Infinite Context LLMs	Dynamic Selection without Permanent Eviction	arXiv	Link
2024	ClusterKV: Manipulating LLM KV Cache in Semantic Space for Recallable Compression	Dynamic Selection without Permanent Eviction	arXiv	Link

KV Cache Budget Allocation

Layer-wise Budget Allocation (To Top👆🏻)

Year	Title	Type	Venue	Paper	code
2024	PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling	Layer-wise Budget Allocation	arXiv	Link	Link
2024	PyramidInfer: Pyramid KV Cache Compression for High-throughput LLM Inference	Layer-wise Budget Allocation	Findings	Link	Link
2024	DynamicKV: Task-Aware Adaptive KV Cache Compression for Long Context LLMs	Layer-wise Budget Allocation	ICLR sub.	Link
2024	PrefixKV: Adaptive Prefix KV Cache is What Vision Instruction-Following Models Need for Efficient Generation	Layer-wise Budget Allocation	arXiv	Link	Link
2024	SimLayerKV: A Simple Framework for Layer-Level KV Cache Reduction	Layer-wise Budget Allocation	arXiv	Link	Link

Head-wise Budget Allocation (To Top👆🏻)

Year	Title	Type	Venue	Paper	code
2024	Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference	Head-wise Budget Allocation	arXiv	Link
2024	Identify Critical KV Cache in LLM Inference from an Output Perturbation Perspective	Head-wise Budget Allocation	ICLR sub.	Link
2024	Unifying KV Cache Compression for Large Language Models with LeanKV	Head-wise Budget Allocation	arXiv	Link
2024	RazorAttention: Efficient KV Cache Compression Through Retrieval Heads	Head-wise Budget Allocation	arXiv	Link
2024	Not All Heads Matter: A Head-Level KV Cache Compression Method with Integrated Retrieval and Reasoning	Head-wise Budget Allocation	arXiv	Link	Link
2024	DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads	Head-wise Budget Allocation	arXiv	Link	Link

KV Cache Merging

Intra-layer Merging (To Top👆🏻)

Year	Title	Type	Venue	Paper	code
2024	Compressed Context Memory for Online Language Model Interaction	Intra-layer Merging	ICLR	Link	Link
2024	LoMA: Lossless Compressed Memory Attention	Intra-layer Merging	arXiv	Link
2024	Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference	Intra-layer Merging	ICML	Link	Link
2024	CaM: Cache Merging for Memory-efficient LLMs Inference	Intra-layer Merging	ICML	Link	Link
2024	D2O: Dynamic Discriminative Operations for Efficient Generative Inference of Large Language Models	Intra-layer Merging	arXiv	Link
2024	AIM: Adaptive Inference of Multi-Modal LLMs via Token Merging and Pruning	Intra-layer Merging	arXiv	Link	Link
2024	LOOK-M: Look-Once Optimization in KV Cache for Efficient Multimodal Long-Context Inference	Intra-layer Merging	EMNLP	Link	Link
2024	Model Tells You Where to Merge: Adaptive KV Cache Merging for LLMs on Long-Context Tasks	Intra-layer Merging	arXiv	Link
2024	CHAI: Clustered Head Attention for Efficient LLM Inference	Intra-layer Merging	arXiv	Link

Cross-layer Merging (To Top👆🏻)

Year	Title	Type	Venue	Paper	code
2024	MiniCache: KV Cache Compression in Depth Dimension for Large Language Models	Cross-layer Merging	arXiv	Link	Link
2024	KVSharer: Efficient Inference via Layer-Wise Dissimilar KV Cross-Layer Sharing	Cross-layer Merging	arXiv	Link	Link

KV Cache Quantization

Fixed-precision Quantization (To Top👆🏻)

Year	Title	Type	Venue	Paper	code
2024	QJL: 1-Bit Quantized JL Transform for KV Cache Quantization with Zero Overhead	Fixed-precision Quantization	arXiv	Link	Link
2024	PQCache: Product Quantization-based KVCache for Long Context LLM Inference	Fixed-precision Quantization	arXiv	Link
2023	FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU	Fixed-precision Quantization	ICML	Link	Link
2022	ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers	Fixed-precision Quantization	NIPS	Link	Link

Mixed-precision Quantization (To Top👆🏻)

Year	Title	Type	Venue	Paper	code
2024	KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization	Mixed-precision Quantization	arXiv	Link	Link
2024	IntactKV: Improving Large Language Model Quantization by Keeping Pivot Tokens Intact	Mixed-precision Quantization	arXiv	Link	Link
2024	SKVQ: Sliding-window Key and Value Cache Quantization for Large Language Models	Mixed-precision Quantization	arXiv	Link	Link
2024	KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache	Mixed-precision Quantization	arXiv	Link	Link
2024	WKVQuant: Quantizing Weight and Key/Value Cache for Large Language Models Gains More	Mixed-precision Quantization	arXiv	Link
2024	GEAR: An Efficient KV Cache Compression Recipe for Near-Lossless Generative Inference of LLM	Mixed-precision Quantization	arXiv	Link	Link
2024	No Token Left Behind: Reliable KV Cache Compression via Importance-Aware Mixed Precision Quantization	Mixed-precision Quantization	arXiv	Link
2024	ZipVL: Efficient Large Vision-Language Models with Dynamic Token Sparsification	Mixed-precision Quantization	arXiv	Link
2024	ZipCache: Accurate and Efficient KV Cache Quantization with Salient Token Identification	Mixed-precision Quantization	arXiv	Link	Link
2024	PrefixQuant: Static Quantization Beats Dynamic through Prefixed Outliers in LLMs	Mixed-precision Quantization	arXiv	Link	Link
2024	MiniKV: Pushing the Limits of LLM Inference via 2-Bit Layer-Discriminative KV Cache	Mixed-precision Quantization	arXiv	Link

Outlier Redistribution (To Top👆🏻)

Year	Title	Type	Venue	Paper	code
2024	Massive Activations in Large Language Models	Outlier Redistribution	arXiv	Link	Link
2024	QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs	Outlier Redistribution	arXiv	Link	Link
2024	QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving	Outlier Redistribution	arXiv	Link	Link
2024	SpinQuant: LLM Quantization with Learned Rotations	Outlier Redistribution	arXiv	Link	Link
2024	DuQuant: Distributing Outliers via Dual Transformation Makes Stronger Quantized LLMs	Outlier Redistribution	NeurIPS	Link	Link
2024	SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models	Outlier Redistribution	ICML	Link	Link
2024	Outlier Suppression+: Accurate quantization of large language models by equivalent and optimal shifting and scaling	Outlier Redistribution	EMNLP	Link	Link
2024	AffineQuant: Affine Transformation Quantization for Large Language Models	Outlier Redistribution	arXiv	Link	Link
2024	FlatQuant: Flatness Matters for LLM Quantization	Outlier Redistribution	arXiv	Link	Link
2024	AWQ: Activation-aware Weight Quantization for On-Device LLM Compression and Acceleration	Outlier Redistribution	MLSys	Link	Link
2023	OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models	Outlier Redistribution	arXiv	Link	Link
2023	Training Transformers with 4-bit Integers	Outlier Redistribution	NeurIPS	Link	Link

KV Cache Low-rank Decomposition

Singular Value Decomposition (To Top👆🏻)

Year	Title	Type	Venue	Paper	code
2024	Effectively Compress KV Heads for LLM	Singular Value Decomposition	arXiv	Link
2024	Eigen Attention: Attention in Low-Rank Space for KV Cache Compression	Singular Value Decomposition	arXiv	Link	Link
2024	Zero-Delay QKV Compression for Mitigating KV Cache and Network Bottlenecks in LLM Inference	Singular Value Decomposition	arXiv	Link
2024	LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy	Singular Value Decomposition	arXiv	Link
2024	ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference	Singular Value Decomposition	arXiv	Link	Link
2024	Palu: Compressing KV-Cache with Low-Rank Projection	Singular Value Decomposition	arXiv	Link	Link

Tensor Decomposition (To Top👆🏻)

Year	Title	Type	Venue	Paper	code
2024	Unlocking Data-free Low-bit Quantization with Matrix Decomposition for KV Cache Compression	Tensor Decomposition	ACL	Link	Link

Learned Low-rank Approximation (To Top👆🏻)

Year	Title	Type	Venue	Paper	code
2024	Get More with LESS: Synthesizing Recurrence with KV Cache Compression for Efficient LLM Inference	Learned Low-rank Approximation	arXiv	Link	Link
2024	MatryoshkaKV: Adaptive KV Compression via Trainable Orthogonal Projection	Learned Low-rank Approximation	arXiv	Link

Model-level Optimization

Attention Grouping and Sharing

Intra-Layer Grouping (To Top👆🏻)

Year	Title	Type	Venue	Paper	code
2019	Fast Transformer Decoding: One Write-Head is All You Need	Intra-Layer Grouping	arXiv	Link
2023	GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints	Intra-Layer Grouping	EMNLP	Link	Link
2024	Optimised Grouped-Query Attention Mechanism for Transformers	Intra-Layer Grouping	ICML	Link
2024	Weighted Grouped Query Attention in Transformers	Intra-Layer Grouping	arXiv	Link
2024	QCQA: Quality and Capacity-aware grouped Query Attention	Intra-Layer Grouping	arXiv	Link	Non-official Link
2024	Beyond Uniform Query Distribution: Key-Driven Grouped Query Attention	Intra-Layer Grouping	arXiv	Link	Link
2023	GQKVA: Efficient Pre-training of Transformers by Grouping Queries, Keys, and Values	Intra-Layer Grouping	NeurIPS	Link

Cross-Layer Sharing (To Top👆🏻)

Year	Title	Type	Venue	Paper	code
2024	Reducing Transformer Key-Value Cache Size with Cross-Layer Attention	Cross-Layer Sharing	arXiv	Link	Non-official Link
2024	Layer-Condensed KV Cache for Efficient Inference of Large Language Models	Cross-Layer Sharing	ACL	Link	Link
2024	Beyond KV Caching: Shared Attention for Efficient LLMs	Cross-Layer Sharing	arXiv	Link	Link
2024	MLKV: Multi-Layer Key-Value Heads for Memory Efficient Transformer Decoding	Cross-Layer Sharing	arXiv	Link	Link
2024	Cross-layer Attention Sharing for Large Language Models	Cross-Layer Sharing	arXiv	Link
2024	A Systematic Study of Cross-Layer KV Sharing for Efficient LLM Inference	Cross-Layer Sharing	arXiv	Link
2024	Lossless KV Cache Compression to 2%	Cross-Layer Sharing	arXiv	Link
2024	DHA: Learning Decoupled-Head Attention from Transformer Checkpoints via Adaptive Heads Fusion	Cross-Layer Sharing	NeurIPS	Link
2024	Value Residual Learning For Alleviating Attention Concentration In Transformers	Cross-Layer Sharing	arXiv	Link	Link

Architecture Alteration

Enhanced Attention (To Top👆🏻)

Year	Title	Type	Venue	Paper	code
2024	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	Enhanced Attention	arXiv	Link	Link
2022	Transformer Quality in Linear Time	Enhanced Attention	ICML	Link
2024	Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention	Enhanced Attention	arXiv	Link

Augmented Architecture (To Top👆🏻)

Year	Title	Type	Venue	Paper	code
2024	You Only Cache Once: Decoder-Decoder Architectures for Language Models	Augmented Architecture	arXiv	Link	Link
2024	Long-Context Language Modeling with Parallel Context Encoding	Augmented Architectures	ACL	Link	Link
2024	XC-CACHE: Cross-Attending to Cached Context for Efficient LLM Inference	Augmented Architectures	Findings	Link
2024	Block Transformer: Global-to-Local Language Modeling for Fast Inference	Augmented Architectures	arXiv	Link	Link

Non-transformer Architecture

Adaptive Sequence Processing Architecture (To Top👆🏻)

Year	Title	Type	Venue	Paper	code
2023	RWKV: Reinventing RNNs for the Transformer Era	Adaptive Sequence Processing Architecture	Findings	Link	Link
2024	Mamba: Linear-Time Sequence Modeling with Selective State Spaces	Adaptive Sequence Processing Architecture	arXiv	Link	Link
2023	Retentive Network: A Successor to Transformer for Large Language Models	Adaptive Sequence Processing Architecture	arXiv	Link	Link
2024	MCSD: An Efficient Language Model with Diverse Fusion	Adaptive Sequence Processing Architecture	arXiv	Link

Hybrid Architecture (To Top👆🏻)

Year	Title	Type	Venue	Paper	code
2024	MixCon: A Hybrid Architecture for Efficient and Adaptive Sequence Modeling	Hybrid Architecture	IOS Press	Link
2024	GoldFinch: High Performance RWKV/Transformer Hybrid with Linear Pre-Fill and Extreme KV-Cache Compression	Hybrid Architecture	arXiv	Link	Link
2024	RecurFormer: Not All Transformer Heads Need Self-Attention	Hybrid Architecture	arXiv	Link

System-level Optimization

Memory Management

Architectural Design (To Top👆🏻)

Year	Title	Type	Venue	Paper	code
2024	vTensor: Flexible Virtual Tensor Management for Efficient LLM Serving	Architectural Design	arXiv	Link	Link
2024	Unifying KV Cache Compression for Large Language Models with LeanKV	Architectural Design	arXiv	Link
2023	Efficient Memory Management for Large Language Model Serving with PagedAttention	Architectural Design	SOSP	Link	Link

Prefix-aware Design (To Top👆🏻)

Year	Title	Type	Venue	Paper	code
2024	ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and Two-Phase Partition	Prefix-aware Design	ACL	Link	Link
2024	MemServe:FlexibleMemPoolforBuilding DisaggregatedLLMServingwithCaching	Prefix-aware Design	arXiv	Link

Scheduling

Prefix-aware Scheduling (To Top👆🏻)

Year	Title	Type	Venue	Paper	code
2024	BatchLLM: Optimizing Large Batched LLM Inference with Global Prefix Sharing and Throughput-oriented Token Batching	Prefix-aware Scheduling	arXiv	Link
2024	SGLang: Efficient Execution of Structured Language Model Programs	Prefix-aware Scheduling	NeurIPS	Link	Link

Preemptive and Fairness-oriented Scheduling (To Top👆🏻)

Year	Title	Type	Venue	Paper	code
2024	Fast Distributed Inference Serving for Large Language Models	Preemptive and Fairness-oriented Scheduling	arXiv	Link
2024	FASTSWITCH: OPTIMIZING CONTEXT SWITCHING EFFICIENCY IN FAIRNESS-AWARE LARGE LANGUAGE MODEL SERVING	Preemptive and Fairness-oriented Scheduling	arXiv	Link

Layer-specific and Hierarchical Scheduling (To Top👆🏻)

Year	Title	Type	Venue	Paper	code
2024	LayerKV: Optimizing Large Language Model Serving with Layer-wise KV Cache Management	Layer-specific and Hierarchical Scheduling	arXiv	Link	Link
2024	Cost-Efficient Large Language Model Serving for Multi-turn Conversations with CachedAttention	Layer-specific and Hierarchical Scheduling	USENIX ATC	Link
2024	ALISA: Accelerating Large Language Model Inference via Sparsity-Aware KV Caching	Layer-specific and Hierarchical Scheduling	ISCA	Link
2024	Fast Inference for Augmented Large Language Models	Layer-specific and Hierarchical Scheduling	arXiv	Link

Hardware-aware Design

Single/Multi-GPU Design (To Top👆🏻)

Year	Title	Type	Venue	Paper	code
2024	Hydragen: High-Throughput LLM Inference with Shared Prefixes	Single/Multi-GPU Design	arXiv	Link	Link
2024	DeFT: Decoding with Flash Tree-attention for Efficient Tree-structured LLM Inference	Single/Multi-GPU Design	arXiv	Link
2024	DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving	Single/Multi-GPU Design	OSDI	Link	Link
2024	Multi-Bin Batching for Increasing LLM Inference Throughput	Single/Multi-GPU Design	arXiv	Link
2024	Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clusters	Single/Multi-GPU Design	arXiv	Link	Link
2023	Efficient Memory Management for Large Language Model Serving with PagedAttention	Single/Multi-GPU Design	SOSP	Link	Link
2022	Orca: A Distributed Serving System for Transformer-Based Generative Models	Single/Multi-GPU Design	OSDI	Link

I/O-based Design (To Top👆🏻)

Year	Title	Type	Venue	Paper	code
2024	Bifurcated Attention: Accelerating Massively Parallel Decoding with Shared Prefixes in LLMs	I/O-based Design	arXiv	Link	Link
2024	Efficient LLM Inference with I/O-Aware Partial KV Cache Recomputation	I/O-based Design	arXiv	Link
2024	Fast State Restoration in LLM Serving with HCache	I/O-based Design	arXiv	Link
2024	Compute Or Load KV Cache? Why Not Both?	I/O-based Design	arXiv	Link
2024	FastSwitch: Optimizing Context Switching Efficiency in Fairness-aware Large Language Model Serving	I/O-based Design	arXiv	Link
2022	FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness	I/O-based Design	NeurIPS	Link	Link

Heterogeneous Design (To Top👆🏻)

Year	Title	Type	Venue	Paper
2024	NEO: Saving GPU Memory Crisis with CPU Offloading for Online LLM Inference	Heterogeneous Design	arXiv	Link
2024	FASTDECODE: High-Throughput GPU-Efficient LLM Serving using Heterogeneous Pipelines	Heterogeneous Design	arXiv	Link
2024	vTensor: Flexible Virtual Tensor Management for Efficient LLM Serving	Heterogeneous Design	arXiv	Link
2024	InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management	Heterogeneous Design	arXiv	Link
2024	Fast Distributed Inference Serving for Large Language Models	Heterogeneous Design	arXiv	Link
2024	Efficient LLM Inference with I/O-Aware Partial KV Cache Recomputation	Heterogeneous Design	arXiv	Link
2023	Stateful Large Language Model Serving with Pensieve	Heterogeneous Design	arXiv	Link

SSD-based Design (To Top👆🏻)

Year	Title	Type	Venue	Paper	code
2024	InstInfer: In-Storage Attention Offloading for Cost-Effective Long-Context LLM Inference	SSD-based Design	arXiv	Link
2023	FlexGen: High-Throughput Generative Inference of Large Language Models	SSD-based Design	ICML	Link	Link

Datasets and Benchmarks

Please refer to our paper for detailed information on this section.

Files

README.md

Latest commit

History