Summary of some awesome works for optimizing LLM inference
This summary will including three parts:
- some repositories that you can follow
- some representative person or labs that you can follow
- some important works in the different research interests
For example, LLMSys-PaperList contains many excellent articles, and is keeping updating (which I believe is the most important for a paperlist). Awesome-LLM-Inference and Awesome_LLM_Accelerate-PaperList are also worth reading.
Besides, awesome-AI-system works also very well. And you can find other repositories in its content.
The log "Large Transformer Model Inference Optimization" helps me a lot at the beginning.
This log OpenAI Keynote on Building Scalable AI Infrastructure seems to be a laeding guidance.
Follow others' research, and find yourself's idea.
It is not my intention to judge the work of these pioneers, and I understand that the shortness of my knowledge will lead me to leave out many important people.
If you have a different opinion, please feel free to communicate with me through the issue.
In no particular order!!
Damn, I can't remember the names of foreigners.
Zhihao JIA: FlexFlow and other imporessive work, important role in MLSys, affiliated with CMU
Tianqi CHEN: TVM, XGBoost, and other imporessive work, important role in Machine Learning System and ML compilers, affiliated with CMU
Song HAN: many important work in efficient ML including sparsity and quantization. btw, the class TinyML and Efficient Deep Learning Computing is highly recommanded, affiliated with MIT
Zhen DONG: many important work in quantization and high-performance ML, affiliated with UCB
Tri DAO: author of FlashAttention, affiliated with Princeton
Ce ZHANG: famous in efficient MLsys, affiliated with UChicago
Ion Stoica: Alpa, Ray, Spark, et.al.
SPCL: Scalable Parallel Computing Lab, affiliated with ETHz
Luo MAI: affiliated with University of Edinburgh
IPADS: focus more on PURE systems, buut also make great progress in MLSys, affiliated with SJTU
EPCC: Emerging Parallel Computing Center, parallel computing and MLSys are Naturally combined, affiliated with SJTU
Xin JIN: FastServe and LLMCad are impressive work, affiliated with PKU
Bin CUI: important role in MLSys including DL, GNN, and MoE, affiliated with PKU
Jidong ZHAI: leading many important work in MLSys, affiliated with THU
Lingxiao MA: with many important work in MLSys on Top-Conference, affiliated with MSRA
Cheng LI: high performce system and MLSys, affiliated with USTC
Xupeng Miao: SpotServe, SpecInfer, HET, et.al
Chuan WU: with some important work in distributed machine learning systems, affiliated with HKU
James CHENG: affiliated with CUHK
Kai CHEN: database works well with MLSys, affiliated with HKUST
Lei CHEN: database works well with MLSys, many papers so I recommand u to focus on his Top-Conference paper, affiliated with HKUST
Yang YOU: leader of Colossal-AI, affiliated with NUS
Wei WANG: work in System and MLSys, affiliated with HKUST
I hope to conlude these impressive works based on their research direction.
But my summary must not be informative enough, and I am looking forward to your addition.
Perhaps someone should write a detailed survey.
Periodically check the "cited by" of the papers with ⭐ will be helpful.
Paragraphs with 💡 are not perfect.
- ⭐ Dissecting the Runtime Performance of the Training, Fine-tuning, and Inference of Large Language Models: evaluations helps you find the bottleneck
- ⭐ Full Stack Optimization of Transformer Inference: a Survey: a survey by UCB
- ⭐ Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems: worth a read
- ⭐ Deep Learning Workload Scheduling in GPU Datacenters: A Survey: survey for GPU Datacenters DL Workload Scheduling
- ⭐ Towards Efficient and Reliable LLM Serving: A Real-World Workload Study: a benchmark for LLM serving
- ⭐ LLM Inference Unveiled: Survey and Roofline Model Insights: both survey and analysis
- A SURVEY OF RESOURCE-EFFICIENT LLM AND MULTIMODAL FOUNDATION MODELS: worth reading
- Training and Serving System of Foundation Models: A Comprehensive Survey
- Model Compression and Efficient Inference for Large Language Models: A Survey
- ⭐ Towards Coarse-to-Fine Evaluation of Inference Efficiency for Large Language Models
- ⭐ A Survey on Efficient Inference for Large Language Models: worth reading
- Beyond the Speculative Game: A Survey of Speculative Execution in Large Language Models
- ⭐ Navigating Challenges and Technical Debt in Large Language Models Deployment: important
- The CAP Principle for LLM Serving: anothor angle
- Demystifying Data Management for Large Language Models: talking about database in LLM, by Xupeng MIAO, accepted by SIDMOD'24
- Benchmarking LLM Inference Backends: vLLM, LMDeploy, MLC-LLM, TensorRT-LLM, and TGI: with code
- A Survey on Mixture of Experts
- Analyzing LLM performance: The impact of high-bandwidth memory on model inference: analyze of inference
- Inference Optimization of Foundation Models on AI Accelerators
- LLM Inference Serving: Survey of Recent Advances and Opportunities: newest
- A Survey on Mixture of Experts
- LLM Inference Serving: Survey of Recent Advances and Opportunities: better than nothing
- Contemporary Model Compression on Large Language Models Inference: survey in model compression
- ⭐ Fire-Flyer AI-HPC: A Cost-Effective Software-Hardware Co-Design for Deep Learning: bring insights for MLSys
- Resource-efficient Algorithms and Systems of Foundation Models: A Survey
- ⭐ A Survey on Inference Optimization Techniques for Mixture of Experts Models: asurvey on MoE models
- Deploying Foundation Model Powered Agent Services: A Survey: survey for AI agent service
- Resource-efficient Algorithms and Systems of Foundation Models: A Survey
- A Survey on Inference Engines for Large Language Models: Perspectives on Optimization and Efficiency
- Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey: survey on PEFT
Make useful benchmark or evaluation is helfpul.
-
MLPerf Inference Benchmark: inference github, a well-known benchmark
-
llmperf: evaluate both performance and correctness, but based on ray
-
The Importance of Workload Choice in Evaluating LLM Inference Systems: important angles in LLM inference systems
-
Vidur: A Large-Scale Simulation Framework For LLM Inference: test the performance of LLM inference
-
Metron: Holistic Performance Evaluation Framework for LLM Inference Systems: an evaluation framework
-
LLMServingSim: A HW/SW Co-Simulation Infrastructure for LLM Inference Serving at Scale: a Simulator
-
LLM-Inference-Bench: Inference Benchmarking of Large Language Models on AI Accelerators: inference + hardware
-
Towards Efficient Large Multimodal Model Serving: a survey on mm serving, and a decoupled serving architecture that enables independent resource allocation and adaptive scaling for each stage
-
LLMCompass: Enabling Efficient Hardware Design for Large Language Model Inference: a performance evaluation framework, can be used to estimate the time cost
-
Predicting LLM Inference Latency: A Roofline-Driven ML Method: predict inference performance based on Roofline
-
GUIDE: A Global Unified Inference Engine for Deploying Large Language Models in Heterogeneous Environments: a work for predict LLMSys performance
-
TokenSim: Enabling Hardware and Software Exploration for Large Language Model Inference Systems: simulator provide some performance analysis
- DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning: deepseek: mla + moe
- Every FLOP Counts: Scaling a 300B Mixture-of-Experts LING LLM without Premium GPUs: moe training with lower-specification hardware
Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads, pdf
prior paper: Blockwise Parallel Decoding for Deep Autoregressive Models
Break the Sequential Dependency of LLM Inference Using Lookahead Decoding: by lookahead decoding
Both frameworks use parallel decoding, and deserve a more detailed research.
There are some interesting papers about parallel decoding.
- Fast Chain-of-Thought: A Glance of Future from Parallel Decoding Leads to Answers Faster
- Skeleton-of-Thought: Large Language Models Can Do Parallel Decoding
- ProPD: Dynamic Token Tree Pruning and Generation for LLM Parallel Decoding
- APAR: LLMs Can Do Auto-Parallel Auto-Regressive Decoding: how to make it auto-parallel?
In fact, I'm not so familiar with with topic. But perhaps OpenAI 4o1 used this...
Spend more time inferencing than pre-training
- ⭐ Large Language Monkeys: Scaling Inference Compute with Repeated Sampling: Starter material, apply repeated sampling
- ⭐ Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters: Starter material, scaling LLM Test-Time to improve accuracy
- Strategic Chain-of-Thought: Guiding Accurate Reasoning in LLMs through Strategy Elicitation: seems fewer people have explore the efficiency of CoT; a two-stage method gives me some throught
- Fast Best-of-N Decoding via Speculative Rejection: optimize alignment in inference, accepted by NIPS'24
- S*: Test Time Scaling for Code Generation: perhaps can do some acceleration on Test Time Scaling
- Thinking Short and Right Over Thinking Long: Serving LLM Reasoning Efficiently and Accurately: manage the CoT
This topic is about GPT-o1, aka the strawberry.
- ⭐ Reverse engineering OpenAI’s o1: a leading blog for introduction in OpenAI’s o1
- ⭐ Chain-of-Thought Prompting Elicits Reasoning in Large Language Models: base work
- Tree of Thoughts: Deliberate Problem Solving with Large Language Models: a improment based on CoT
- Large Language Model Guided Tree-of-Thought: also a ToT
- Let's Verify Step by Step: verify by step can be helpful
- Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models: what is Language Agent Tree Search (LATS)? accepted by ICML'24
- Critique-out-Loud Reward Models
- Generative Verifiers: Reward Modeling as Next-Token Prediction: a verifier, by DeepMind
Also named as Speculative Sampling, model collaboration.
- ⭐ Accelerating Large Language Model Decoding with Speculative Sampling: opening of Speculative Decoding, by DeepMind
- ⭐ Fast inference from transformers via speculative decoding: work of similar period with the upper one, by Google, accepted by ICML'23
- SpecInfer: Accelerating Generative Large Language Model Serving with Speculative Inference and Token Tree Verification: paper under guidance of Zhihao JIA, use Tree decoding and a set of draft models
- LLMCad: Fast and Scalable On-device Large Language Model Inference: paper under guidance of Xin JIN, speculative decoding for on-device LLM inference based on tree decoding and other optimizations
- Speculative Decoding with Big Little Decoder: similar to speculative decoding, accepted in NIPS'23
- Online Speculative Decoding: update draft model online
- Predictive Pipelined Decoding: A Compute-Latency Trade-off for Exact LLM Decoding: the trade-off analyse deserves a reading
- The Synergy of Speculative Decoding and Batching in Serving Large Language Models: analyse for combining the spec decoding with batching
- REST: Retrieval-Based Speculative Decoding: use retrieval for spec decoding, some familiar names in the authors list
- Cascade Speculative Drafting for Even Faster LLM Inference: by UIUC
- Multi-Candidate Speculative Decoding: multiple draft models
- ⭐ Unlocking Efficiency in Large Language Model Inference: A Comprehensive Survey of Speculative Decoding: survey for Speculative Decoding
- BiTA: Bi-Directional Tuning for Lossless Acceleration in Large Language Models
- EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty
- GliDe with a CaPE: A Low-Hassle Method to Accelerate Speculative Decoding: a work with Yang YOU's name
- Decoding Speculative Decoding: provide some insight into the selection of draft models
- Ouroboros: Speculative Decoding with Large Model Enhanced Drafting: perhaps tree specualtive decoding?
- ⭐ Speculative Streaming: Fast LLM Inference without Auxiliary Models: a promising method for speculative decoding
- Sequoia: Scalable, Robust, and Hardware-aware Speculative Decoding: accelerating spec decoding
- Chimera: A Lossless Decoding Method for Accelerating Large Language Models Inference by Fusing all Tokens: accelerate spec decoding with Fusing all tokens
- Minions: Accelerating Large Language Model Inference with Adaptive and Collective Speculative Decoding: using several SSMs, adaptive SSM prediction length, pipelining SSM decode and LLM verify
- Recurrent Drafter for Fast Speculative Decoding in Large Language Models
- Optimal Block-Level Draft Verification for Accelerating Speculative Decoding
- Accelerating LLM Inference with Staged Speculative Decoding: token tree and a second stage of speculative decoding
- EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty
- TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding: combine KV cache with spec decoding
- EMS-SD: Efficient Multi-sample Speculative Decoding for Accelerating Large Language Models: algorithm optimization in spec decoding
- SpecExec: Massively Parallel Speculative Decoding for Interactive LLM Inference on Consumer Devices: any difference with specinfer?
- Optimizing Speculative Decoding for Serving Large Language Models Using Goodput: model the speculative decoding length
- MagicDec: Breaking the Latency-Throughput Tradeoff for Long Context Generation with Speculative Decoding: spec decoding for long-context
- QSpec: Speculative Decoding with Complementary Quantization Schemes: spec decoding with quantization, a novel A+B
- Cerberus: Efficient Inference with Adaptive Parallel Decoding and Sequential Knowledge Enhancement: optimization ob Medusa
- The N-Grammys: Accelerating autoregressive inference with learning-free batched speculation: use learning-free, negligible-cost draft strategies, namely N-grams obtained from the model weights and the context
- EdgeLLM: Fast On-device LLM Inference with Speculative Decoding: seem a extended work of LLMCad
- AdaServe: SLO-Customized LLM Serving with Fine-Grained Speculative Decodin: a speculation-and-selection scheme, that first constructs candidate token trees for each request and then dynamically selects tokens to meet individual SLO constraints
- SpecServe: Efficient and SLO-Aware Large Language Model Serving with Adaptive Speculative Decoding: dynamically adjusts speculative strategies according to real-time request loads and system configurations
- ML-SpecQD: Multi-Level Speculative Decoding with Quantized Drafts: combining multi-level speculative decoding with MXFP4 quantized drafts, simple but work
- SPIN: Accelerating Large Language Model Inference with Heterogeneous Speculative Models: using multiple heterogeneous SSMs with a learning-based algorithm for SSM selection, request decomposition method to minimize batching overhead during LLM verification, pipelining speculation and verification phases on GPU
- ⭐ SpecReason: Fast and Accurate Inference-Time Compute via Speculative Reasoning: target at reason model and CoT, under guidence of Zhihao JIA; maybe refer to multi-agent?
- SpecRouter: Adaptive Routing for Multi-Level Speculative Decoding in Large Language Models: Multi-Level Speculative Decoding, under guidence of Jidong Zhai
- Think Big, Generate Quick: LLM-to-SLM for Fast Autoregressive Decoding: use both LLM and SLM
- Adaptive Skeleton Graph Decoding: successor of Skeleton-of-Thought
Some knowledege about data parallel, model tensor parallel, and model pipeline parallel will help in this track.
- ⭐ Efficiently Scaling Transformer Inference: use model parallel to accelerating inference, by Google, in MLSys'23
- HexGen: Generative Inference of Foundation Model over Heterogeneous Decentralized Environment: a distributed inference engine that supports asymmetric partitioning of the inference computation
- InternEvo: Efficient Long-sequence Large Language Model Training via Hybrid Parallelism and Redundant Sharding: Efficient Long-sequence training
- Liger: Interleaving Intra- and Inter-Operator Parallelism for Distributed Large Model Inference: accepted by PPoPP'24
- MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs: full-stack approach of LLM training
- DSP: Dynamic Sequence Parallelism for Multi-Dimensional Transformers: sequence parallel by Yang YOU
- LoongServe: Efficiently Serving Long-context Large Language Models with Elastic Sequence Parallelism: Elastic Sequence Parallelism?
- GraphPipe: Improving Performance and Scalability of DNN Training with Graph Pipeline Parallelism: this could be potential in inference
- TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale Language Models: pipeline parallism
- QUART: Latency-Aware FaaS System for Pipelining Large Model Inference: pipeline in serving and fast expanding
- Mnemosyne: Parallelization Strategies for Efficiently Serving Multi-Million Context Length LLM Inference Requests Without Approximations: optimize sequence parallel
- CSPS: A Communication-Efficient Sequence-Parallelism based Serving System for Transformer based Models with Long Prompts: optimize sequence parallel
- ⭐ PipeInfer: Accelerating LLM Inference using Asynchronous Pipelined Speculation: pipeline parallelism and speculation, accepted by SC'24
- HexGen-2: Disaggregated Generative Inference of LLMs in Heterogeneous Environment: algorithm analyse for resource allocation, parallel strategy and kv transfer in disaggreagting llm system
- ADOR: A Design Exploration Framework for LLM Serving with Enhanced Latency and Throughput: explores design spaces to suggest architectures that meet the requirements of both vendors and users
- Seesaw: High-throughput LLM Inference via Model Re-sharding: dynamic model re-sharding, facilitates the dynamic reconfiguration of parallelization strategies across prefill-decode stages, accepted by MLSYS'25
- PipeFill: Using GPUs During Bubbles in Pipeline-parallel LLM Training: fill the bubbles with other GPU workload
- ⭐ gLLM: Global Balanced Pipeline Parallelism System for Distributed LLM Serving with Token Throttling: fine-grained scheduling policy that independently regulates the quantities of prefill and decode tokens, to balance the pipelien stage in PP
- Medha: Efficiently Serving Multi-Million Context Length LLM Inference Requests Without Approximations: Sequence Pipeline Parallelism (SPP) to reduce time-to-first-token by pipelining prefill chunks, and KV-Cache Parallelism (KVP) to lower time-peroutput-token by distributing decoding across servers
- Overlap Communication with Dependent Computation via Decomposition in Large Deep Learning Models: overlap comm with comp, similar to Liger
- Centauri: Enabling Efficient Scheduling for Communication-Computation Overlap in Large Model Training via Communication Partitioning: accepted by ASPLOS'24
- T3: Transparent Tracking & Triggering for Fine-grained Overlap of Compute & Collectives: many work about overlap in LLM, accepted by ASPLOS'24
- FLUX: Fast Software-based Communication Overlap On GPUs Through Kernel Fusion: Fine-grained decomposition, perhaps provide some experiment result
- Kraken: Inherently Parallel Transformers For Efficient Multi-Device Inference: modify the model design for fast decoding, based on comm-comp overlapping
- NanoFlow: Towards Optimal Large Language Model Serving Throughput: overlaping based on nano-batch, with some interesting engineer implemntation
- Domino: Eliminating Communication in LLM Training via Generic Tensor Slicing and Overlapping: overlapping, provided by Deepspeed team
- PRESERVE: Prefetching Model Weights and KV-Cache in Distributed LLM Serving: overlap communication with model-weights/KV-cache prefetch
- Concerto: Automatic Communication Optimization and Scheduling for Large-Scale Deep Learning: use compilation to schedule overlap, accepted by ASPLOS'25
- TileLink: Generating Efficient Compute-Communication Overlapping Kernels using Tile-Centric Primitives: TileLink to efficiently generate overlapped kernels for LLMs using tile-centric primitives and mappings, accepted by MLSYS'25
- FlashOverlap: A Lightweight Design for Efficiently Overlapping Communication and Computation: a novel signaling mechanism to identify tile-wise data dependency without interrupting the computation process, and reorders data to contiguous addresses, enabling communication by simply calling NCCL APIs
Ignore some of the earliest papers and focus on the latest work to optimize this.
- ⭐ Seesaw: High-throughput LLM Inference via Model Re-sharding: dynamic model re-sharding to facilitates the dynamic reconfiguration of parallelization strategies across stages, reduce the overhead caused by frequent stage transitions (seems like Elastic Scheduling)
- DynamicAttention: Dynamic KV Cache for Disaggregate LLM Inference: DynamicAttention, it allocates a continuous virtual GPU memory space at startup, but does not actually allocate physical GPU memory?
- ⭐ semi-PD: Towards Efficient LLM Serving via Phase-Wise Disaggregated Computation and Unified Storage: disaggregated computation and unified storage, a computation resource controller to achieve disaggregated computation at the streaming multi-processor (SM) level, and a unified memory manager to manage the asynchronous memory access from both phases
- ShuffleInfer: Disaggregate LLM Inference for Mixed Downstream Workloads: requests grouping, disaggregation and resource scheduling
- Arrow: Adaptive Scheduling Mechanisms for Disaggregated LLM Inference Architecture: dynamically adjusts the number of instances handling prefill and decode tasks based on real-time cluster performance metrics
- DynaServe: Unified and Elastic Execution for Dynamic Disaggregated LLM Serving: arbitrarily splits each request at any token boundary into at most two cooperating segments, then use a two-level scheduling framework then balances micro-request load across unified GPU instances
An enduring topic in efficient machine learning.
We mainly focus on Semi-structured and Structured pruning becasue they can accelerate computing.
-
⭐ Accelerating Sparse Deep Neural Networks: use N:M sparsity to fully utilize the hardware for accelerating, by Nvidia
-
⭐ Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time: interesting paper in using sparsity, under guidence of Tri DAO and Ce ZHANG, accepted in ICML'23
-
Dynamic Context Pruning for Efficient and Interpretable Autoregressive Transformers
-
Dynamic N:M Fine-Grained Structured Sparse Attention Mechanism: accepted by PPoPP'23
-
⭐ PIT: Optimization of Dynamic Sparse Deep Learning Models via Permutation Invariant Transformation: A novel way to deal with dynamic sparsity may be used for GNN and MoE, accepted by SOSP'23
-
DéjàVu: KV-cache Streaming for Fast, Fault-tolerant Generative LLM Serving: seem a follow-up work of Deja Vu, also focus on KV-Cache
-
FFSplit: Split Feed-Forward Network For Optimizing Accuracy-Efficiency Trade-off in Language Model Inferenc: sparsity in FFN
-
ProSparse: Introducing and Enhancing Intrinsic Activation Sparsity within Large Language Models: a simple and effective sparsification method named "ProSparse"
-
Turbo Sparse: Achieving LLM SOTA Performance with Minimal Activated Parameters: work for powerinfo
-
Pruning Large Language Models to Intra-module Low-rank Architecture with Transitional Activations: pruning for LLM
-
MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention: inference framework based on sparse attention, by Microsoft
-
ReLU Strikes Back: Exploiting Activation Sparsity in Large Language Models: use ReLU to imporve Sparsity, just like powerinfer
-
CoreInfer: Accelerating Large Language Model Inference with Semantics-Inspired Adaptive Sparse Activation: algorithm optimization that can utilize sparsity to accelerate inference
-
Star Attention: Efficient LLM Inference over Long Sequences: a two-phase block-sparse approximation
-
Lexico: Extreme KV Cache Compression via Sparse Coding over Universal Dictionaries: use Sparse Coding over Universal Dictionaries to compress KV cache, it's novelty
-
SHARP: Accelerating Language Model Inference by SHaring Adjacent layers with Recovery Parameters: algorithm to replace a layer with the previous Adjacent layer and Recovery Parameters(based on finetune), to decrease memory overhead
-
Efficient LLM Inference using Dynamic Input Pruning and Cache-Aware Masking: accepted by MLSYS'25
-
SpInfer: Leveraging Low-Level Sparsity for Efficient Large Language Model Inference on GPUs: Tensor-Core-Aware Bitmap Encoding (TCA-BME) and sparse Gemm kernel, make unstructured pruning's theoretical advantages translate into practical performance gains, EuroSys'25
-
Samoyeds: Accelerating MoE Models with Structured Sparsity Leveraging Sparse Tensor Cores: EuroSys'25
-
LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention: Efficient long-context LLM serving with unified block sparse attention, up to 3.3x faster decoding than TensorRT-LLM, accpeted by MLSYS'25
Low-precision for memory and computing efficiency.
- Understanding and Overcoming the Challenges of Efficient Transformer Quantization
- ⭐ LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale: by UW
- ⭐ SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models: paper under guidance of Song HAN
- ⭐ AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration: paper under guidance of Song HAN
- Atom: Low-bit Quantization for Efficient and Accurate LLM Serving: paper under guidance of Tianqi CHEN, quantization is not important, designing how to quantify is important, in review of MLSys'24
- FineQuant: Unlocking Efficiency with Fine-Grained Weight-Only Quantization for LLMs
- QUIK: Towards End-to-End 4-Bit Inference on Generative Large Language Models
- Understanding the Impact of Post-Training Quantization on Large Language Models: tech report will help
- ⭐ LLM-FP4: 4-Bit Floating-Point Quantized Transformers: by HKUST, accepted in EMNLP'23
- ⭐ Enabling Fast 2-bit LLM on GPUs: Memory Alignment, Sparse Outlier, and Asynchronous Dequantization: by SJTU, accepted in DAC'24
- INT4 Wight + FP8 KV-Cache: optimization for LLM inference: INT4 Wight + FP8 KV-Cache + Continues batching
- KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache
- KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization: quant KV cache
- QUICK: Quantization-aware Interleaving and Conflict-free Kernel for efficient LLM inference: simple and crude optimization work
- LLM-PQ: Serving LLM on Heterogeneous Clusters with Phase-Aware Partition and Adaptive Quantization: for Heterogeneous Clusters and Adaptive Quantization, under guidence of Chuan WU, accepted by PPoPP'24(poster)
- IntactKV: Improving Large Language Model Quantization by Keeping Pivot Tokens Intact: use pivot token
- QAQ: Quality Adaptive Quantization for LLM KV Cache
- QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving: quantization in inference, under guidence of Song HAN
- Does compressing activations help model parallel training?: analyse in compressing(including pruning and quantization) in MP training, accepted by MLSys'24
- Unlocking Data-free Low-bit Quantization with Matrix Decomposition for KV Cache Compression: compress KV cache with quantization
- Mitigating Quantization Errors Due to Activation Spikes in GLU-Based LLMs: with targeted activate function
- FP6-LLM: Efficiently Serving Large Language Models Through FP6-Centric Algorithm-System Co-Design: FPx quantization, accepted by ATC'24
- Demystifying the Compression of Mixture-of-Experts Through a Unified Framework: combine quantization with MoE
- Does Compressing Activations Help Model Parallel Training?: quantization Activation?
- PQCache: Product Quantization-based KVCache for Long Context LLM Inference: apply quantization and Maximum Inner-Product Search for KV Cache compression
- Fast Matrix Multiplications for Lookup Table-Quantized LLMs: provide efficient kernels for lookup quantization
- Ladder: Enabling Efficient Low-Precision Deep Learning Computing through Hardware-aware Tensor Transformation: a computation optimization for Low-Precision
- Quant-LLM: Accelerating the Serving of Large Language Models via FP6-Centric Algorithm-System Co-Design on Modern GPUs: a computation optimization for 6-bit LLM
- Mixture of Experts with Mixture of Precisions for Tuning Quality of Service: quantization on MoE models
- Zero-Delay QKV Compression for Mitigating KV Cache and Network Bottlenecks in LLM Inference: compress the KV Cache
- ABQ-LLM: Arbitrary-Bit Quantized Inference Acceleration for Large Language Models: quantization matrix multiplication of arbitrary precision combinations based on BTC (Binary TensorCore) equivalents
- Progressive Mixed-Precision Decoding for Efficient LLM Inference: gradual lowering of precision deeper in the generated sequence, together with a spectrum of precision-switching schedulers
- COMET: Towards Partical W4A4KV4 LLMs Serving: provide quantization algorithm, quantization kernel and SM schedule method
- MixQ: Taming Dynamic Outliers in Mixed-Precision Quantization by Online Prediction: quantization with outliers, optimization on AWQ, accepted by SC'24
- Flash Communication: Reducing Tensor Parallelization Bottleneck for Fast Large Language Model Inference: low-bit compression to accelerate communication
- Unifying KV Cache Compression for Large Language Models with LeanKV: combine quantization and sparity to compress KV cache
- MixLLM: LLM Quantization with Global Mixed-precision between Output-features and Highly-efficient System Design: mix quantization, effectively assigning the larger bit-width to output features that need it most to achieve good accuracy with low memory consumption
- KVTuner: Sensitivity-Aware Layer-wise Mixed Precision KV Cache Quantization for Efficient and Nearly Lossless LLM Inference: KVTuner to adaptively search for the optimal hardware-friendly layer-wise KV quantization precision pairs for coarse-grained KV cache with multi-objective optimization and directly utilize the offline searched configurations during online inference
- HACK: Homomorphic Acceleration via Compression of the Key-Value Cache for Disaggregated LLM Inference: quantization to decrease kvc transfer overhead in disaggregation and eliminate kv dequantization
- MARLIN: Mixed-Precision Auto-Regressive Parallel Inference on Large Language Models: Mixed-precision Auto-Regressive LINear kernels, accepted by PPoPP'25
- MiLo: Efficient Quantized MoE Inference with Mixture of Low-Rank Compensators: augments highly quantized MoEs with a mixture of low-rank compensators, provide 3-bit tensorcore kernels, accepted by MLSYS'25
- PacQ: A SIMT Microarchitecture for Efficient Dataflow in Hyper-asymmetric GEMMs: accelerator design, but may be helpful
- Cocktail: Chunk-Adaptive Mixed-Precision Quantization for Long-Context LLM Inference: based on mixed-precision quantization to the key-value (KV) cache in LLMs based on token granularity, do quantization on KV cache chunk-level
- Oaken: Fast and Efficient LLM Serving with Online-Offline Hybrid KV Cache Quantization: employs an online-offline hybrid approach, setting outlier thresholds offline, which are then used to determine the quantization scale online
- SQuat: Subspace-orthogonal KV Cache Quantization: a more efficient quantization algorithm(?)
- Tilus: A Virtual Machine for Arbitrary Low-Precision GPGPU Computation in LLM Serving: support Arbitrary Low-Precision computation with high performance
Perhaps the most important way for improving the throughput in LLM inference.
This blog Dissecting Batching Effects in GPT Inference helps me a lot at the beginning.
Update2023/12/12: I'd like to use Continues Batching
to take place of the Dynamic Batching
I used before. The name Dynamic Batching
is more likely to be used in Triton.
- ⭐ Orca: A Distributed Serving System for Transformer-Based Generative Models: Continues batch processing without redundant computing, accepted in OSDI'23
- Fast Distributed Inference Serving for Large Language Models: considering Job Completion Time(JCT) in LLM serving, paper under guidance of Xin JIN
- Response Length Perception and Sequence Scheduling: An LLM-Empowered LLM Inference Pipeline: schedule based on response length prediction by LLM, paper under guidance of Yang YOU
- S3: Increasing GPU Utilization during Generative Inference for Higher Throughput: idea similar to above, by Harvard University
- SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills: blocking the prefill phase and reduce pipeline bubbles, by MSRIndia
- Flover: A Temporal Fusion Framework for Efficient Autoregressive Model Parallel Inference: accepted by HiPC'23
- Handling heavy-tailed input of transformer inference on GPUs: accepted by ICS'22
- CoFB: latency-constrained co-scheduling of flows and batches for deep learning inference service on the CPU–GPU system: Some form of inference service
- TCB: Accelerating Transformer Inference Services with Request Concatenation: perhaps similar to ByteTransformer, accepted by ICPP'22
- Fairness in Serving Large Language Models: under guidence of Ion Stoica, accepted by OSDI'24
- Characterizing and understanding deep neural network batching systems on GPUs: benchmarking is important
- Hydragen: High-Throughput LLM Inference with Shared Prefixes
- RelayAttention for Efficient Large Language Model Serving with Long System Prompts: think about the memory access of KV cache
- Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve: follow-up work of sarathi
- Efficient Interactive LLM Serving with Proxy Model-based Sequence Length Prediction: predict length
- LiveMind: Low-latency Large Language Models with Simultaneous Inference: perform inferences with incomplete prompts, to take advantage of streaming prompt
- A Queueing Theoretic Perspective on Low-Latency LLM Inference with Variable Token Length: theoretical analysis of latency
- ElasticBatch: A Learning-Augmented Elastic Scheduling System for Batch Inference on MIG
- Prepacking: A Simple Method for Fast Prefilling and Increased Throughput in Large Language Models: seems similar to ORCA or bytetransformer?
- BATON: Enhancing Batch-wise Inference Efficiency for Large Language Models via Dynamic Re-batching: optimization on ORCA, dynamic re-batching
- EcoServe: Maximizing Multi-Resource Utilization with SLO Guarantees in LLM Serving: A fusion monster with a variety of optimization techniques
- ⭐ AcceLLM: Accelerating LLM Inference using Redundancy for Load Balancing and Data Locality: what's Redundancy
- Optimizing LLM Inference Throughput via Memory-aware and SLA-constrained Dynamic Batching: formalize as an optimization problem and adjust the batch size based on this
- Bullet: Boosting GPU Utilization for LLM Serving via Dynamic Spatial-Temporal Orchestration: fine-grained chunked prefill with decode, but what is SM-masked stream?
This part include some impressive work optimizing LLM computing by observing the underlying computing properties. Such as FlashAttention, et.al.
- ⭐ FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness: one of the most important work these years, both simple and easy to use, by Tri DAO
- ⭐ FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning: you'd better not ignore it
- ⭐ Flash-Decoding for long-context inference: you'd better not ignore it, too
- ⭐ Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity: successor to FlashAttention in inference, accepted by VLDB'24
- ⭐ FlashDecoding++: Faster Large Language Model Inference on GPUs: worth reading, FLashDecoding follow-up
- SubGen: Token Generation in Sublinear Time and Memory
- DeFT: Flash Tree-attention with IO-Awareness for Efficient Tree-search-based LLM Inference
- Lean Attention: Hardware-Aware Scalable Attention Mechanism for the Decode-Phase of Transformers: modification in self-attention
- FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision
- Flex Attention: A Programming Model for Generating Optimized Attention Kernels: auto-generated attention kernel
- Splitwise: Efficient generative LLM inference using phase splitting: splitting prefill and decode in a map-reduce style, by UW and Microsoft
- DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving: also split the prefill and decode, accepted by OSDI'24
- Inference without Interference: Disaggregate LLM Inference for Mixed Downstream Workloads: seems a combination of SARATHI and Splitwise
- ExeGPT: Constraint-Aware Resource Scheduling for LLM Inference: similar to splitwise, accepted by ASPLOS'24
- Splitwiser: Efficient LLM Inference with Constrained Resources
- ToEx: Accelerating Generation Stage of Transformer-based Language Models via Token-adaptive Early Exit: Token-adaptive Early Exit
- Automatic Task Parallelization of Dataflow Graphs in ML/DL models
- MonoNN: Enabling a New Monolithic Optimization Space for Neural Network Inference Tasks on Modern GPU-Centric Architectures: compilation optimization on compuataion graph
- POD-Attention: Unlocking Full Prefill-Decode Overlap for Faster LLM Inference: optimize attention kernel in mix-batching
- Focus: High-Performant and Customizable Attention Engine for LLM Serving: flexible attention engine, advised by Chen Tianqi and accepted by MLSYS'25
- ML-Triton, A Multi-Level Compilation and Language Extension to Triton GPU Programming: Multi-level Triton
This part is inspired by PagedAttention of vLLM. And there are many Top-Conference paper discussing the memory management in DL computing on GPUs.
-
⭐ Efficient Memory Management for Large Language Model Serving with PagedAttention: memory page management for the KV-Cache in Attention-type model, accepted by SOSP'23 (many papers will cite the vLLM project instead of their paper, which makes it harder for us to find its citated by)
-
⭐ AutoScratch: ML-Optimized Cache Management for Inference-Oriented GPUs: cache management for inference, accepted by MLSys'23
-
Improving Computation and Memory Efficiency for Real-world Transformer Inference on GPUs: block-based data layout, accepted by TACO'October-2023
-
AttMEMO : Accelerating Transformers with Memoization on Big Memory Systems: a unique observation that there is rich similarity in attention computation across inference sequences
-
BPIPE: memory-balanced pipeline parallelism for training large language models: memory balance perhaps can work well in inferencce, by SNU, accepted by ICML'23
-
Improving Large Language Model Throughput with Efficient LongTerm Memory Management: perhaps a new view
-
CacheGen: Fast Context Loading for Language Model Applications
-
Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache
-
Quantized Side Tuning: Fast and Memory-Efficient Tuning of Quantized Large Language Models: consider the memory consumption in fine-tuning
-
Get More with LESS: Synthesizing Recurrence with KV Cache Compression for Efficient LLM Inference
-
Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference: compress KV Cache
-
LLM as a System Service on Mobile Devices: LLM as a service on Mobile devices
-
DistMind: Efficient Resource Disaggregation for Deep Learning Workloads: by Xin JIN, accepted by ToN'Jan24
-
ALISA: Accelerating Large Language Model Inference via Sparsity-Aware KV Caching: sparsity in KV Cache, accepted by ISCA'24
-
AttentionStore: Cost-effective Attention Reuse across Multi-turn Conversations in Large Language Model Serving: a hierarchical KV caching system that leverages cost-effective memory/storage mediums to save KV caches for all requests
-
vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention: improve PagedAttention
-
Layer-Condensed KV Cache for Efficient Inference of Large Language Models: only computes and caches the KVs of a small number of layers
-
MiniCache: KV Cache Compression in Depth Dimension for Large Language Models: compress KV cache
-
CacheBlend: Fast Large Language Model Serving with Cached Knowledge Fusion: very popular idea recently
-
Block Transformer: Global-to-Local Language Modeling for Fast Inference: build KV Cache block from many tokens' KV Cache
-
MemServe: Context Caching for Disaggregated LLM Serving with Elastic Memory Pool: KV Cache management in P/D disaggregation arch
-
Cost-Efficient Large Language Model Serving for Multi-turn Conversations with CachedAttention: multi-round chat and memory management, accepted by ATC'24
-
Stateful Large Language Model Serving with Pensieve: similar to cachedattention
-
Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving: P/D disaggregation archtecture and KV Cache management
-
P/D-Serve: Serving Disaggregated Large Language Model at Scale: a P/D based system, with D2D access optimization
-
InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management: offload KV Cache
-
Keep the Cost Down: A Review on Methods to Optimize LLM' s KV-Cache Consumption: a survey for optimizing KV Cache
-
vTensor: Flexible Virtual Tensor Management for Efficient LLM Serving: tensor management especially for llm inference
-
Token-Picker: Accelerating Attention in Text Generation with Minimized Memory Transfer via Probability Estimation: remove unimportant tokens in KV Cache
-
CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving: compression and streaming transfering of KV Cache, accepted by SIGCOMM'24
-
Compute Or Load KV Cache? Why Not Both?: recompute and load together for long context
-
LayerKV: Optimizing Large Language Model Serving with Layer-wise KV Cache Management: manage KV Cache by layers
-
Harnessing Your DRAM and SSD for Sustainable and Accessible LLM Inference with Mixed-Precision and Multi-level Caching: compress KV cache and multi-level memory
-
EPIC: Efficient Position-Independent Context Caching for Serving Large Language Models: better prefix-cache
-
ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference: Low-rank KV cache and dynamic rebuild KV cache
-
⭐ VL-Cache: Sparsity and Modality-Aware KV Cache Compression for Vision-Language Model Inference Acceleration: the first work I see that optimize KV cache in vision models
-
ArkVale: Efficient Generative LLM Inference with Recallable Key-Value Eviction: KV cache page evict and recall, accepted by NIPS'24
-
SpeedLoader: An I/O efficient scheme for heterogeneous and distributed LLM operation: Optimization on Zero? redesign the data flow of heterogeneous hardware and sharded model training to minimize the excessive communication overhead, accepted by NIPS'24
-
⭐ KunServe: Elastic and Efficient Large Language Model Serving with Parameter-centric Memory Management: memory management for KV cache and parameter, seems a novel work considering the weights migration
-
SYMPHONY: Improving Memory Management for LLM Inference Workloads: dynamically migrates K,V caches to enable finegrained scheduling of inference requests
-
Mell: Memory-Efficient Large Language Model Serving via Multi-GPU KV Cache Management: efficiently migrate requests and their KV cache among GPUs
-
Efficient LLM Inference with Activation Checkpointing and Hybrid Caching: recompute+cache for KV cache management, only recompute attention(no projection)
-
Memory Offloading for Large Language Model Inference with Latency SLO Guarantees: offload kv cache to CPU memory
-
Progressive Sparse Attention: Algorithm and System Co-design for Efficient Attention in LLM Serving: sparse attention is hot recently, dynamic kvcache budget and efficient kvc loading from CPU
-
Efficient and scalable huge embedding model training via distributed cache management: staleness and skewed popularity distributions based cache
-
BaKlaVa -- Budgeted Allocation of KV cache for Long-context Inference: different kv heads have different importance, then offload and compress
-
Fast State Restoration in LLM Serving with HCache: cache for offloading kvc to CPU, accepted by EuroSys'25
-
Mind the Memory Gap: Unveiling GPU Bottlenecks in Large-Batch LLM Inference: use model replication to improve serving throughput and GPU utilization?
-
Characterizing the Behavior and Impact of KV Caching on Transformer Inferences under Concurrency: instrument vLLM to measure and analyze fine-grain metrics (token throughput, KV cache memory access patterns, load balancing of the forward passes), during different inference stages (prefill, decode, batching and KV cache eviction policies) in several scenarios
-
Mitigating KV Cache Competition to Enhance User Experience in LLM Inference: mitigating KV Cache competition with several technology
-
Towards More Economical Context-Augmented LLM Generation by Reusing Stored KV Cache: KV cache reusing is able to save cloud cost across a range of workloads with long context
-
KVSort: Drastically Improving LLM Inference Performance via KV Cache Compression: error-bounded lossy compression on sorted KV vectors
-
FastCache: Optimizing Multimodal LLM Serving through Lightweight KV-Cache Compression Framework: dynamic batching and kv cache pool in MM kv cache compression, guided by Jidong ZHAI
-
Accelerating LLM Serving for Multi-turn Dialogues with Efficient Resource Management: multi-level KV cache management(an idea lack innovation) and request reorder, accepted by ASPLOS'25
-
Aqua: Network-Accelerated Memory Offloading for LLMs in Scale-Up GPU Domains: memory management framework for a sudden increase in the number of inference requests to a cloud-hosted LLM, accepted by ASPLOS'25
-
⭐ Jenga: Effective Memory Management for Serving LLM with Heterogeneity: optimization on PagedAttention, targeted at heterogeneous embeddings in LLMs
-
Accelerating LLM Inference Throughput via Asynchronous KV Cache Prefetching: kv cache load/offload?
-
Hardware-based Heterogeneous Memory Management for Large Language Model Inference: an asymmetric memory architecture consisting of capacity-centric and bandwidth-centric memory with computation units attached to each memory device, more like a hardware paper
-
Rethinking Key-Value Cache Compression Techniques for Large Language Model Serving: Survey and Analyze Key-Value Cache Compression Techniques for Large Language Model Serving, accepted by MLSYS'25
-
FastTree: Optimizing Attention Kernel and Runtime for Tree-Structured LLM Inference: FastTree, which introduces GPU kernels tailored for efficiently processing queries that share contexts through the radix tree
note: some papers about prefix sharing is not in this section
- LLM Query Scheduling with Prefix Reuse and Latency Constraints: balancing prefix reuse and fairness in query scheduling
- Marconi: Prefix Caching for the Era of Hybrid LLMs: prefix caching target at State Space Models
- Large Language Model Inference Acceleration: A Comprehensive Hardware Perspective: a helpful survey
- ADOR: A Design Exploration Framework for LLM Serving with Enhanced Latency and Throughput: balances throughput and latency under different hardware
- Understanding and Optimizing Multi-Stage AI Inference Pipelines: a Heterogeneous Multi-stage LLM inference Execution Simulator
- Reducing shared memory footprint to leverage high throughput on Tensor Cores and its flexible API extension library: implement some APIs to reduce the shared memory footprint, accepted in HPC Asia'23
- Benchmarking and Dissecting the Nvidia Hopper GPU Architecture: help us understand GPUs
- SLO-aware GPU Frequency Scaling for Energy Efficient LLM Inference Serving: optimizing energy consuming based on lower GPU frequency
- Foreseer: Knowledge-Driven Acceleration of Memory-Bound Matrix Multiplications for Large Language Model Inference: similar to cutlass, optimization on intel GPU
- Tackling the Dynamicity in a Production LLM Serving System with SOTA Optimizations via Hybrid Prefill/Decode/Verify Scheduling on Efficient Meta-kernels: for Ascend GPU (perhaps also work for NVIDIA?)
- MEPipe: Democratizing LLM Training with Memory-Efficient Slice-Level Pipeline Scheduling on Cost-Effective Accelerators: maybe inference on RTX4090?
- PASK: Cold Start Mitigation for Inference with Proactive and Selective Kernel Loading on GPUs: DAC's paper for hardware
- ⭐ Hardware Compute Partitioning on NVIDIA GPUs: spatially partition the computing units of NVIDIA GPUs transparently, worth reading
- Improving GPU Sharing Performance through Adaptive Bubbleless Spatial-Temporal Sharing: Bless leverages precise computing resource management and fine-grained kernel scheduling to ensure stringent quota guarantees and reduce latency fairly for applications with varying GPU quotas, accepted by EuroSys'25
Heterogeneous scenarios or single PC are becoming increasingly important.
Making optimization for the calculating on CPU or SSD will have different methods.
-
Efficient LLM Inference on CPUs: LLMs with quantization on CPUs, by Intel, accepted by NIPS'23
-
Inference Performance Optimization for Large Language Models on CPUs: xFasterTransformer, LLMs inference optimization on CPUs, by Intel
-
Distributed Inference Performance Optimization for LLMs on CPUs: similar work to above, by Intel
-
Exploiting Intel Advanced Matrix Extensions (AMX) for Large Language Model Inference: inference on CPU based on advanced hardware
-
TURNIP: A "Nondeterministic" GPU Runtime with CPU RAM Offload: free to run operations such as GPU kernel calls in many different orders
-
Improving Throughput-oriented Generative Inference with CPUs: cooperate of CPUs and GPU, accepted by APSys'23
-
Chrion: Optimizing Recurrent Neural Network Inference by Collaboratively Utilizing CPUs and GPUs: execute the operators on the CPU and GPU in parallel, by SJTU
-
EdgeNN: Efficient Neural Network Inference for CPU-GPU Integrated Edge Devices: inference on edge devices, accepted by ICDE'23
-
PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU: by SJTU IPADS
-
LLM in a flash: Efficient Large Language Model Inference with Limited Memory: by Apple
-
Efficient LLM inference solution on Intel GPU: intel GPU is interesting
-
FastDecode: High-Throughput GPU-Efficient LLM Serving using Heterogeneous Pipelines: efficient serving with CPU-GPU system
-
Efficient and Economic Large Language Model Inference with Attention Offloading: similar to FastDecode
-
Glinthawk: A Two-Tiered Architecture for High-Throughput LLM Inference: similar to fastdecode: cpu for attention and gpu for others
-
Petals: Collaborative Inference and Fine-tuning of Large Models: looks like heterogeneous resources are being utilized
-
Characterizing and Optimizing LLM Inference Workloads on CPU-GPU Coupled Architectures: analysis performance on loosely-coupled (PCIe A100/H100) and closely-coupled (GH200) systems
-
NoMAD-Attention: Efficient LLM Inference on CPUs Through Multiply-add-free Attention
-
⭐ A Quantitative Analysis and Guidelines of Data Streaming Accelerator in Modern Intel Xeon Scalable Processors: use CPU for DL, accepted by ASPLOS'24
-
LM-Offload: Performance Model-Guided Generative Inference of Large Language Models with Parallelism Control: based on offload
-
T-MAC: CPU Renaissance via Table Lookup for Low-Bit LLM Deployment on Edge: computation on CPU with quantization
-
TBA: Faster Large Language Model Training Using SSD-Based Activation Offloading: how to use SSD?
-
InstInfer: In-Storage Attention Offloading for Cost-Effective Long-Context LLM Inference: offload KV Cache to CSD(Computational Storage Drive)
-
TwinPilots: A New Computing Paradigm for GPU-CPU Parallel LLM Inference: some idea in using CPU
-
Improving Throughput-oriented LLM Inference with CPU Computations: pipeline in CPU-GPU inference
-
Understanding Performance Implications of LLM Inference on CPUs: analyse of using CPU for inference
-
GPUs, CPUs, and... NICs: Rethinking the Network's Role in Serving Complex AI Pipelines: NIC can be important, especially in communication
-
Pie: Pooling CPU Memory for LLM Inference: use CPU memory to enlarge batchsize to improve throughput, by Ion Stoica
-
NEO: Saving GPU Memory Crisis with CPU Offloading for Online LLM Inference: offload KV cache and attention to CPU for larger batchsize, similar to fastdecode, by Ion Stoica, accepted by MLSYS'25
-
Task Scheduling for Efficient Inference of Large Language Models on Single Moderate GPU Systems: more likely inference on personal device
-
Efficient LLM Inference with I/O-Aware Partial KV Cache Recomputation: use recomputation and transfer to re-produce KV cache; can use their run-time and split parallelism
Inspired by AI PC, open up a new area.
Including edge systems now.
- FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU: inference a 30B model with a 16GB GPU, accepted by ICML'23
- LLM as a System Service on Mobile Devices: an intro for LLM on private devices
- PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU: based on sparsity in NN Layers
- ⭐ LLM for Mobile: An Initial Roadmap: a road map
- PowerInfer-2: Fast Large Language Model Inference on a Smartphone: work on smartphone
- Cambricon-LLM: A Chiplet-Based Hybrid Architecture for On-Device Inference of 70B LLM: on edge devices, accepted by MICRO'24
- ⭐ HeteroLLM: Accelerating Large Language Model Inference on Mobile SoCs with Heterogeneous AI Accelerators: features on mobile SoCs, tensor partition strategy, to do Heterogeneous AI inference
- PICE: A Semantic-Driven Progressive Inference System for LLM Serving in Cloud-Edge Networks: cloud(LLM)-edge(SmallLM) collaboration
- FlexInfer: Breaking Memory Constraint via Flexible and Efficient Offloading for On-Device LLM Inference: offloading based framework, asynchronous prefetching, balanced memory locking, and flexible tensor preservation
- Fast On-device LLM Inference with NPUs: chunked prefill, offload outlier to CPU/GPU, schedule computation to NPU/CPU/GPU, accepted by ASPLOS'25
- FlexInfer: Flexible LLM Inference with CPU Computations: offload kvc and weights to CPU, accepted by MLSYS'25
- An Adaptive and Scalable Framework for Resource-Efficient Deployment of Mixture of Experts in LLM-Based Intelligent IoT Networks: deploy MoE on IoT, but the strategies are commonly used
- A Novel Hat-Shaped Device-Cloud Collaborative Inference Framework for Large Language Models: can learn the edge-cloud serving from this paper, based on speculation decode
- HERA: Hybrid Edge-cloud Resource Allocation for Cost-Efficient AI Agents: assign sub-tasks of LLM agent to local SLM and cloud-side LLM
-
FusionAI: Decentralized Training and Deploying LLMs with Massive Consumer-Level GPUs: decentrailized system on consumer-level GPUs, through there will be some problems
-
Distributed Inference and Fine-tuning of Large Language Models Over The Internet: some techs in this paper will be instructive
-
⭐ HeteGen: Heterogeneous Parallel Inference for Large Language Models on Resource-Constrained Devices: heterogeneous parallel computing using CPUs and GPUs
-
Metis: Fast Automatic Distributed Training on Heterogeneous GPUs: accepted by ATC'24
-
Helix: Distributed Serving of Large Language Models via Max-Flow on Heterogeneous GPUs: we can get performance model for Heterogeneous GPUs cluster and learn the algorithm analyse
-
Mélange: Cost Efficient Large Language Model Serving by Exploiting GPU Heterogeneity: making heterogeneity-aware GPU provisioning decisions for LLM serving
In this part, researchers provide some algorithm-based method to optimizing LLM inference.
- H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models: accepted by NIPS'23
- Scissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression at Test Time: consider the different importance of tokens in KV Cache, similar to H2O
- ⭐ SkipDecode: Autoregressive Skip Decoding with Batching and Caching for Efficient LLM Inference: skipping maybe an useful method like spec decoding
- Inference with Reference: Lossless Acceleration of Large Language Models: also a potential optimization
- Efficient Streaming Language Models with Attention Sinks: streaming LLM for infinite sequence lengths, by MIT and under guidence of Song HAN
- Keyformer: KV Cache Reduction through Key Tokens Selection for Efficient Generative Inference: also important tokens, just like H2O, accepted by MLSys'24
- Q-Hitter: A Better Token Oracle for Efficient LLM Inference via Sparse-Quantized KV Cache: an optimization to H2O, accepted by MLSys'24
- RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval: use approximate nearest neighbor search to search the most relevant KV cache
- CritiPrefill: A Segment-wise Criticality-based Approach for Prefilling Acceleration in LLMs: based on observation: adjacent query tokens tend to focus on similar subsets of the past Key-Value (KV) cache
- TidalDecode: Fast and Accurate LLM Decoding with Position Persistent Sparse Attention: sparse attention
- SwiftKV: Fast Prefill-Optimized Inference with Knowledge-Preserving Model Transformation: algorithm optimization for less KV Cache
- Activation Sequence Caching: High-Throughput and Memory-Efficient Generative Inference with a Single GPU: use characterization results to optimize KV Cache management
- ⭐ DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale: you must know DeepSpeed
- DeepSpeed-FastGen: High-throughput Text Generation for LLMs via MII and DeepSpeed-Inference
- DeepSpeed Model Implementations for Inference (MII)
- ByteTransformer: A High-Performance Transformer Boosted for Variable-Length Inputs: developed by ByteDance, accepted by IPDPS'23
- TurboTransformers: an efficient GPU serving system for transformer models: by Tencent Inc, accepted by PPoPP'21
- Accelerating Generative AI with PyTorch II: GPT, Fast: a blog in PyTorch, use only PyTorch code, gpt-fast
- FlexFlow Serve: Low-Latency, High-Performance LLM Serving: based on FlexFlow
- FlashInfer: Kernel Library for LLM Serving
- FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving
- Efficiently Programming Large Language Models using SGLang: we can get some optimization from here
- Inferflow: an Efficient and Highly Configurable Inference Engine for Large Language Models: different parallel, by Tencent
LLM server providers will focus on this part. Engineering practices are just as important as algorithm optimization.
-
⭐ AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving: accepted by OSDI'23
-
⭐ STI: Turbocharge NLP Inference at the Edge via Elastic Pipelining: Elastic will be important in the future, accepted by ASPLOS'23
-
INFaaS: Automated Model-less Inference Serving: accepted by ATC'21
-
Tabi: An Efficient Multi-Level Inference System for Large Language Models: under guidence of Kai CHEN, accepted by EuroSys'23
-
FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance: cost is the service provider cares most
-
FaaSwap: SLO-Aware, GPU-Efficient Serverless Inference via Model Swapping
-
Shockwave: Fair and Efficient Cluster Scheduling for Dynamic Adaptation in Machine Learning: accepted by NSDI'23
-
Cocktail: A Multidimensional Optimization for Model Serving in Cloud: model ensembling, accepted by NSDI'22
-
SLA-Driven ML INFERENCE FRAMEWORK FOR CLOUDS WITH HETEROGENEOUS ACCELERATORS: accepted by MLSys'22
-
FaST-GShare: Enabling Efficient Spatio-Temporal GPU Sharing in Serverless Computing for Deep Learning Inference: accepted by ICPP'23
-
Flashpoint: A Low-latency Serverless Platform for Deep Learning Inference Serving
-
BATCH: Machine Learning Inference Serving on Serverless Platforms with Adaptive Batching: accepted by SC'20
-
MArk: exploiting cloud services for cost-effective, SLO-aware machine learning inference serving: accepted by ATC'19
-
⭐ MLaaS in the Wild: Workload Analysis and Scheduling in Large-Scale Heterogeneous GPU Clusters: challenges and solutions in real-world scenarios, accepted by NSDI'22
-
SuperServe: Fine-Grained Inference Serving for Unpredictable Workloads: under the guidence of Ion Stoica
-
Learned Best-Effort LLM Serving: a best-effort serving system of UCB
-
Microsecond-scale Preemption for Concurrent GPU-accelerated DNN Inferences: accepted by OSDI'22, enables microsecond-scale kernel preemption and controlled concurrent execution in GPU scheduling
-
PipeSwitch: fast pipelined context switching for deep learning applications: PipeSwitch, a system that enables unused cycles of an inference application to be filled by training or other inference applications, accepted by OSDI'20
-
⭐ Paella: Low-latency Model Serving with Software-defined GPU Scheduling: how the tasks are scheduled to GPUs, accepted by SOSP'23
-
OTAS: An Elastic Transformer Serving System via Token Adaptation: elastic in serving while considering SLO
-
DeltaZip: Multi-Tenant Language Model Serving via Delta Compression: Multi-Tenant is interesting
-
ServerlessLLM: Locality-Enhanced Serverless Inference for Large Language Models: find different problems in serving LLMs
-
Fast and Efficient Model Serving Using Multi-GPUs with Direct-Host-Access: accepted by EuroSys'23
-
Towards Pareto Optimal Throughput in Small Language Model Serving: Small Language Model Serving
-
MOPAR: A Model Partitioning Framework for Deep Learning Inference Services on Serverless Platforms
-
Andes: Defining and Enhancing Quality-of-Experience in LLM-Based Text Streaming Services: idea of QoE
-
FlexLLM: A System for Co-Serving Large Language Model Inference and Parameter-Efficient Finetuning: how to find novel questions?
-
Deferred Continuous Batching in Resource-Efficient Large Language Model Serving: similar to FlexLLM
-
LLMServingSim: A Simulation Infrastructure for LLM Inference Serving Systems: provide some features about LLM serving
-
Slice-Level Scheduling for High Throughput and Load Balanced LLM Serving: Improvements to ORCA(SLS) and FastServe(ILS)
-
Offline Energy-Optimal LLM Serving: Workload-Based Energy Models for LLM Inference on Heterogeneous Systems: consider serving efficiency from energy view
-
Power-aware Deep Learning Model Serving with μ-Serve: consider energy
-
Eloquent: A More Robust Transmission Scheme for LLM Token Streaming: a new token transmission scheme, useful in chatbot
-
Responsive ML inference in multi-tenanted environments using AQUA: serving several LLMs based on time-sharing GPUs cycles, in multi-tenanted environments
-
Towards SLO-Optimized LLM Serving via Automatic Inference Engine Tuning: effect of hyper-parameters in inference engine
-
Intelligent Router for LLM Workloads: Improving Performance Through Workload-Aware Scheduling: request schedule
-
Efficient LLM Scheduling by Learning to Rank: rank request based on output length predict and schedule
-
Responsive ML inference in multi-tenanted environments using AQUA: offload context to other GPUs in multi-tenant environment
-
UELLM: A Unified and Efficient Approach for LLM Inference Serving: serving optimization in MaaS clouds
-
One Queue Is All You Need: Resolving Head-of-Line Blocking in Large Language Model Serving: shcduling the requests
-
ConServe: Harvesting GPUs for Low-Latency and High-Throughput Large Language Model Serving: harvest stranded GPU resources for offline LLM inference tasks
-
LLM-Pilot: Characterize and Optimize Performance of your LLM Inference Services: accepted by SC'24
-
Revisiting SLO and Goodput Metrics in LLM Serving: check metrics SLO and Goodput in LLM serving
-
Hops: Fine-grained heterogeneous sensing, efficient and fair Deep Learning cluster scheduling system: schedule tasks in multi-tenant deep learning (DL) cluster, accepted by SoCC'24
-
⭐ Ensuring Fair LLM Serving Amid Diverse Applications: ensures fair LLM access across diverse applications, with a copilot trace analysis
-
BlendServe: Optimizing Offline Inference for Auto-regressive Large Models with Resource-aware Batching: exploits the relaxed latency requirements in offline batch inference to reorder and overlap requests with varied resource demands while ensuring high prefix sharing
-
BatchLLM: Optimizing Large Batched LLM Inference with Global Prefix Sharing and Throughput-oriented Token Batching: similar to blendserve
-
iServe: An Intent-based Serving System for LLMs: use cost model to dynamically set deployment configuration
-
TAPAS: Thermal- and Power-Aware Scheduling for LLM Inference in Cloud Platforms: seems a Practical work in engineering? Take into account temperature and power consumption
-
ThunderServe: High-performance and Cost-efficient LLM Serving in Cloud Environments: a novel scheduling algorithm, which optimizes the deployment plan of LLM serving to accommodate the heterogeneous resource and network bandwidth conditions in cloud environments, and fluctuating online conditions
-
⭐ MegaScale-Infer: Serving Mixture-of-Experts at Scale with Disaggregated Expert Parallelism: we can learn for the expert-attention disaggregation
-
SkyServe: Serving AI Models across Regions and Clouds with Spot Instances: seems subsequent work on spotserve, serve AI models over a mixture of spot and on-demand replicas, EuroSys'25
-
Past-Future Scheduler for LLM Serving under SLA Guarantees: efficient requests scheduler via considering the historical distribution of request output lengths and calculating memory occupancy at each future time point, and the framework LightLLM
-
Deferred prefill for throughput maximization in LLM inference: looks a bit counter-intuitive
-
Performance Aware LLM Load Balancer for Mixed Workloads: a heuristic-guided, reinforcement learning-based router with a trainable response-length predictor and a novel formulation for estimating the impact of mixing different workloads
-
Niyama : Breaking the Silos of LLM Inference Serving: request schedule paper
-
⭐ Optimizing SLO-oriented LLM Serving with PD-Multiplexing: PD multiplexing, enabling in-place and phase-decoupled compute partition, seems different from simple multiplexing
-
Ascendra: Dynamic Request Prioritization for Efficient LLM Serving: set high or low priority for requests
-
PrefillOnly: An Inference Engine for Prefill-only Workloads in Large Language Model Applications: target at prefill-only workload, which only output one token
-
⭐ ServeGen: Workload Characterization and Generation of Large Language Model Serving in Production: show the real production LLM workload
-
Efficient LLM Serving on Hybrid Real-time and Best-effort Requests: collocate the Real-time and Best-effort Requests, propose request scheduling and KV cache sharing
-
SOLA: Optimizing SLO Attainment for Large Language Model Serving with State-Aware Scheduling: a state-aware scheduling that optimizes the SLO attainment in LLM serving
- ⭐ A System for Microserving of LLMs: seems a idea and industrial practice that makes sense
- DeepFlow: Serverless Large Language Model Serving at Scale: provide fine-grained LLM service
- ⭐ Towards Swift Serverless LLM Cold Starts with ParaServe: pipeline parallelism and dynamic adjust parallelism strategy, and accelerate cold-start
- λScale: Enabling Fast Scaling for Serverless Large Language Model Inference: serverless inference system to achieve fast model scaling, by fast model multicast, inference execution during model transmission and dynamically constructs execution pipelines
- Medusa: Accelerating Serverless LLM Inference with Materialization: target at cold-start of LLM serverlesss, to solve the available KV cache blocks profiling and cuda graph capture problems, accepted by ASPLOS'25
- SMore: Enhancing GPU Utilization in Deep Learning Clusters by Serverless-based Co-location Scheduling: serverless computing reveals an opportunity to optimize gpu utilization with fine-grained resource allocation
- PipeBoost: Resilient Pipelined Architecture for Fast Serverless LLM Scaling: rapidly launch inference services in response to bursty requests without preemptively over-provisioning GPUs
- Enabling Elastic Model Serving with MultiWorld: optimizing collective communication lib for LLM inference
- Flexible Scheduling of Network and Computing Resources for Distributed AI Tasks
- AdapCC: Making Collective Communication in Distributed Machine Learning Adaptive: communicating strategy based on runtime, ICDCS'24
- Crux: GPU-Efficient Communication Scheduling for Deep Learning Training: a communication scheduler that aims to maximize GPU computation utilization by mitigating the communication contention among DLT jobs, SIGCOMM'24
- TENPLEX: Changing Resources of Deep Learning Jobs using Parallelizable Tensor Collections: by Luo MAI, similar to SpotServe?
- SpotServe: Serving Generative Large Language Models on Preemptible Instances: by Xupeng MIAO and under guidence of Zhihao JIA
- Parcae: Proactive, Liveput-Optimized DNN Training on Preemptible Instances: by team of SpotServe
- FaPES: Enabling Efficient Elastic Scaling for Serverless Machine Learning Platforms: a FaaS-oriented Performance-aware Elastic Scaling system to enable efficient resource allocation in serverless platforms for ML jobs, accepted by SoCC'24
- Serving Models, Fast and Slow:Optimizing Heterogeneous LLM Inferencing Workloads at Scale: resource allocation at cluster and data center scale
- Compass: A Decentralized Scheduler for Latency-Sensitive ML Workflows: scheduler for latency-sensitive request
- Llumnix: Dynamic Scheduling for Large Language Model Serving: scheduling in multi instances may by helpful for me now
- Arlo: Serving Transformer-based Language Models with Dynamic Input Lengths: solve Dynamic Input Lengths by multi-instance and request scheduling
- Intelligent Router for LLM Workloads: Improving Performance Through Workload-Aware Scheduling: scheduling based on a output length predictor
- Is the GPU Half-Empty or Half-Full? Practical Scheduling Techniques for LLMs: request scheduling in cluster and on instance
- Fast Inference for Augmented Large Language Models: schedule for Augmented LLM
- ALISE: Accelerating Large Language Model Serving with Speculative Scheduling: prediction-based scheduling + memory management + quantization's hodgepodge
- The Effect of Scheduling and Preemption on the Efficiency of LLM Inference Serving: cost model in request scheduling
- Queue Management for SLO-Oriented Large Language Model Serving: schedule for request with differnt models and differnet SLO requirements
- FastSwitch: Optimizing Context Switching Efficiency in Fairness-aware Large Language Model Serving: fairness and request switch
- HyGen: Efficient LLM Serving via Elastic Online-Offline Request Co-location: request co-location to maximize serving throughput and prevent starvation, without compromising online serving latency
- Locality-aware Fair Scheduling in LLM Serving
- Queueing, Predictions, and LLMs: Challenges and Open Problems: prediction-based queueing and serving
- Throughput-Optimal Scheduling Algorithms for LLM Inference and AI Agents: throughput-optimal scheduling analyse
- Apt-Serve: Adaptive Request Scheduling on Hybrid Cache for Scalable LLM Inference Serving: a memory-efficient hidden cache? and schedule to use a biger batch
- LLMSched: Uncertainty-Aware Workload Scheduling for Compound LLM Applications: an uncertainty-aware scheduling framework for emerging compound LLM applications
- ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and Two-Phase Partition: share prefix and optimize KV Cache
- S-LoRA: Serving Thousands of Concurrent LoRA Adapters: beginninf of Serving for LoRA, under the guidence of Ion Stoica: accepted by MLSys'24
- Dynamic LoRA Serving System for Offline Context Learning: successor of S-LoRA
- CaraServe: CPU-Assisted and Rank-Aware LoRA Serving for Generative LLM Inference: serving LoRA is becoming more and more important
- PUNICA: MULTI-TENANT LORA SERVING: accepted by MLSys'24
- Petals: Collaborative Inference and Fine-tuning of Large Models
- LoRA-Switch: Boosting the Efficiency of Dynamic LLM Adapters via System-Algorithm Co-design: maybe useful, kernel optimization
- dLoRA: Dynamically Orchestrating Requests and Adapters for LoRA LLM Serving: accepted by OSDI'24
- Enhancing LoRA Model Serving Capacity via Adaptive Operator Scheduling for Multi-Tenancy on GPU: optimize SGMV kernels
- V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM: LoRA for vision models, and optimize LoRA kernels, accepted by EuroSys'25
- Efficient Multi-task LLM Quantization and Serving for Multiple LoRA Adapters: facilitates the sharing of a single quantized model for multiple LoRA adapters, accepted by NIPS'24
- Comparative Analysis and Optimization of LoRA Adapter Co-serving for Large Language Models: more like a survey for LoRA serving
- DeltaZip: Efficient Serving of Multiple Full-Model-Tuned LLMs: compress model deltas to serves multiple full-parameter fine-tuned models(maybe not LoRA fine-tune?)
- ServerlessLoRA: Minimizing Latency and Cost in Serverless Inference for LoRA-Based LLMs: in fact similar to S-LoRA, on the background of serverless LLM+LoRA model
For LoRA but not serving
- ASPEN: High-Throughput LoRA Fine-Tuning of Large Language Models with a Single GPU
- LoRAMoE: Alleviate World Knowledge Forgetting in Large Language Models via MoE-Style Plugin: potential new style of LoRA
- Higher Layers Need More LoRA Experts
- GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection
- FlexLLM: A System for Co-Serving Large Language Model Inference and Parameter-Efficient Finetuning: how to find novel questions?
- LoRA Meets Dropout under a Unified Framework: Analyze LoRA algorithmically
- HydraLoRA: An Asymmetric LoRA Architecture for Efficient Fine-Tuning: algorithm optimization for LoRA
- SBoRA: Low-Rank Adaptation with Regional Weight Updates: an algorithm optimization for LoRA
- A Survey on LoRA of Large Language Models: survey of LoRAs, incluing parallel LoRA computing and Multi-LoRA, github
- mLoRA: Fine-Tuning LoRA Adapters via Highly-Efficient Pipeline Parallelism in Multiple GPUs: can study the LoRA-aware pipeline parallelism scheme, github
- MixLoRA: Enhancing Large Language Models Fine-Tuning with LoRA-based Mixture of Experts: LoRA based MoE, github
- GongBu: Easily Fine-tuning LLMs for Domain-specific Adaptation: LLM fine-tuning tools
- Adapters Selector: Cross-domains and Multi-tasks LoRA Modules Integration Usage Method: select several LoRAs for a content
- SplitLLM: Hierarchical Split Learning for Large Language Model over Wireless Network: split learning(?) train lora weights in wireless network environment, store lora in edge servers?
- Revolutionizing Large Model Fine-Tuning: The Role of LoRA in Parameter-Efficient Adaptation: a survey, can provide some reference
- HyC-LoRA: Memory Efficient LoRA Fine-tuning with \textbf{Hy}brid Activation \textbf{C}ompression: optimize fine-tune memory overhead by quantization, accepted by MLSYS'25
- ZO2: Scalable Zeroth-Order Fine-Tuning for Extremely Large Language Models with Limited GPU Memory: fine-tune
- HSplitLoRA: A Heterogeneous Split Parameter-Efficient Fine-Tuning Framework for Large Language Models: split-learnng + LoRA, fine-tune on client device, set different rank for different weights
- Improving the Serving Performance of Multi-LoRA Large Language Models via Efficient LoRA and KV Cache Management: explore the dependencies between requests and LoRAs to reduce TTFT
- Beyond Zero Initialization: Investigating the Impact of Non-Zero Initialization on LoRA Fine-Tuning Dynamics: LoRA algorithm analyze
- Deferred Continuous Batching in Resource-Efficient Large Language Model Serving
- Latency-Guaranteed Co-Location of Inference and Training for Reducing Data Center Expenses: place training and inference together, control the inference latency to the desired SLO, while maximizing the throughput of the training jobs co-located on the same GPUs, accepted by ICDCS'24
Long-Context is a hot point recently.
- Challenges in Deploying Long-Context Transformers: A Theoretical Peak Performance Analysis
- Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference: like a update for H2O or Dejevu, et.al, each attention head have different memory budget
- Context Parallelism for Scalable Million-Token Inference
- TokenSelect: Efficient Long-Context Inference and Length Extrapolation for LLMs via Dynamic Token-Level KV Cache Selection: select some important KV cache to take part in attention computation
Process differnet ML loads in a cluster.
- PAL: A Variability-Aware Policy for Scheduling ML Workloads in GPU Clusters: serve multiple different loads in GPU cluster, accepted by SC'24
- PipeLLM: Fast and Confidential Large Language Model Services with Speculative Pipelined Encryption: why Encryption in LLM inference? by IPADS, accepted by ASPLOS'25
- Topology-aware Preemptive Scheduling for Co-located LLM Workloads: schedule different workloads
- ⭐ Chameleon: a heterogeneous and disaggregated accelerator system for retrieval-augmented language models: retrieval will be helpful, but how to use it?
- Generative Dense Retrieval: Memory Can Be a Burden: accepted by EACL'24
- ⭐ Accelerating Retrieval-Augmented Language Model Serving with Speculation: also a paper for RaLM
- RAGCache: Efficient Knowledge Caching for Retrieval-Augmented Generation: improve RAG inference with cache, under guidence of Xin JIN
- FlashRAG: A Modular Toolkit for Efficient Retrieval-Augmented Generation Research
- Accelerating Retrieval-Augmented Language Model Serving with Speculation: help understand RaLM
- NinjaLLM: Fast, Scalable and Cost-effective RAG using Amazon SageMaker and AWS Trainium and Inferentia2
- Speculative RAG: Enhancing Retrieval Augmented Generation through Drafting: RAG with spec decoding, different draft models with different RAG
- CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion: optimize KV cache reuse(prefix cache)
- RAGServe: Fast Quality-Aware RAG Systems with Configuration Adaptation: trade-off between latency and quantity
- Cache-Craft: Managing Chunk-Caches for Efficient Retrieval-Augmented Generation: combine RAG with prefix cache
- RAGO: Systematic Performance Optimization for Retrieval-Augmented Generation Serving: analyse RAG algorithm then optimize system
- CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion: reuses the precomputed KV caches, regardless prefix or not, and selectively recomputes the KV values of a small subset of tokens to partially update each reused KV cache, accepted by EuroSys'25
- Shared Disk KV Cache Management for Efficient Multi-Instance Inference in RAG-Powered LLMs: KV cache for RAG knowledge which is stored on disk
Here are two repositories have some papers for MoE: Papers: MoE/Ensemble, and MOE papers to read
-
⭐ DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale: accepted by ICML'22
-
Accelerating Distributed MoE Training and Inference with Lina: both training and inference, accepted by ATC'23
-
MegaBlocks: Efficient Sparse Training with Mixture-of-Experts: accepted by MLSys'23
-
Tutel: Adaptive Mixture-of-Experts at Scale: accepted by MLSys'23
-
Pre-gated MoE: An Algorithm-System Co-Design for Fast and Scalable Mixture-of-Expert Inference: accepted by ISCA'24
-
Optimizing Mixture of Experts using Dynamic Recompilations: under guidence of Zhihao JIA
-
Serving MoE Models on Resource-constrained Edge Devices via Dynamic Expert Swapping: expert swapping is interesting
-
Towards MoE Deployment: Mitigating Inefficiencies in Mixture-of-Expert (MoE) Inference: some hot optimizations for inference, accepted by NIPS'24
-
Exploiting Transformer Activation Sparsity with Dynamic Inference
-
SE-MoE: A Scalable and Efficient Mixture-of-Experts Distributed Training and Inference System
-
Who Says Elephants Can't Run: Bringing Large Scale MoE Models into Cloud Scale Production: accepted by ACL'22
-
Fast Inference of Mixture-of-Experts Language Models with Offloading: combine moe with offloading
-
⭐ MoE-Infinity: Activation-Aware Expert Offloading for Efficient MoE Serving: under guidence of Luo MAI, provided some features and design in moe inference
-
Fiddler: CPU-GPU Orchestration for Fast Inference of Mixture-of-Experts Models
-
FlexMoE: Scaling Large-scale Sparse Pre-trained Model Training via Dynamic Device Placement: train MoE with new schedule plan, maybe work for inference
-
Exploiting Inter-Layer Expert Affinity for Accelerating Mixture-of-Experts Model Inference
-
EdgeMoE: Fast On-Device Inference of MoE-based Large Language Models: quantized experts and expers management
-
Toward Inference-optimal Mixture-of-Expert Large Language Models: some analysis for training moe based on inference cost
-
[Parm: Efficient Training of Large Sparsely-Activated Models with Dedicated Schedules]: comm optimization in MoE, accepted by InfoCom'24
-
SiDA: Sparsity-Inspired Data-Aware Serving for Efficient and Scalable Large Mixture-of-Experts Models: based on offload, accepted by MLSys'24
-
Merge, Then Compress: Demystify Efficient SMoE with Hints from Its Routing Policy: introduce some features of MoE, accepted by ICLR'24
-
Demystifying the Compression of Mixture-of-Experts Through a Unified Framework: introduce some features of MoE too
-
Dynamic Mixture of Experts: An Auto-Tuning Approach for Efficient Transformer Models: introduction paper
-
Efficient All-to-All Collective Communication Schedules for Direct-Connect Topologies: all_to_all comm, HPDC'24
-
Scattered Mixture-of-Experts Implementation: ScatterMoE, an implementation of Sparse MoE
-
Shortcut-connected Expert Parallelism for Accelerating Mixture-of-Experts: the Shortcut-connection looks more like a algorithm optimization, and provide oppotunity for overlapping
-
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model: a opsen-source work and it inferences based expert-parallel
-
SwapMoE: Serving Off-the-shelf MoE-based Large Language Models with Tunable Memory Budget: MoE experts offloading, at the cost of reduced accuracy
-
ProMoE: Fast MoE-based LLM Serving using Proactive Caching: optimization on Pre-gated MoE, by IPADS
-
Read-ME: Refactorizing LLMs as Router-Decoupled Mixture of Experts with System Co-Design: pre-gating router decoupled from the MoE backbone that facilitates system-friendly pre-computing and lookahead scheduling, NIPS'24
-
MoEsaic: Shared Mixture of Experts: share Expert among different MoE instance, "MoE's modular architecture lets users compose their model from popular off-the-shelf experts" is a new scenario
-
HOBBIT: A Mixed Precision Expert Offloading System for Fast MoE Inference: use quantization to decrease uncached MoE load overhead, on edge devices
-
ExpertFlow: Optimized Expert Activation and Token Allocation for Efficient Mixture-of-Experts Inference: prediction and offload based optimization
-
MoE-Lightning: High-Throughput MoE Inference on Memory-constrained GPUs: use offload-pipeline to accelerate inference moe on single GPU
-
⭐ MoE-CAP: Cost-Accuracy-Performance Benchmarking for Mixture-of-Experts Systems: benchmarking for MoE systems
-
⭐ Lynx: Enabling Efficient MoE Inference through Dynamic Batch-Aware Expert Selection: damn! I had considered this before:( . key insight is that expert importance varies significantly across tokens and inference phases, utilize this to solve the all-activate problem
-
⭐ EPS-MoE: Expert Pipeline Scheduler for Cost-Efficient MoE Inference: Gemm implemention optimization and alltoall communication overlap
-
⭐ Optimizing Mixture-of-Experts Inference Time Combining Model Deployment and Communication Scheduling: optimize all2all order, co-locate experts from different models
-
MoETuner: Optimized Mixture of Expert Serving with Balanced Expert Placement and Token Routing: utilize the expert dependency to opmizate GPU load balance and alltoall latency
-
fMoE: Fine-Grained Expert Offloading for Large Mixture-of-Experts Serving: fine-grained expert offload, prefetch and cache
-
⭐ Comet: Fine-grained Computation-communication Overlapping for Mixture-of-Experts: fine-grained task schduling and computation-alltoall overlap
-
MoE-Lightning: High-Throughput MoE Inference on Memory-constrained GPUs: offload MoE weights to CPU by layers, accepted by ASPLOS'25
-
eMoE: Task-aware Memory Efficient Mixture-of-Experts-Based (MoE) Model Inference: predict to preload experts from cpu, use same expert for subsequent prompts and skip routing for some tasks
-
MoE-Lens: Towards the Hardware Limit of High-Throughput MoE LLM Serving Under Resource Constraints: CPU-GPU based MoE inference
-
Faster MoE LLM Inference for Extremely Large Models: less activated experts for faster inference
-
⭐ Toward Cost-Efficient Serving of Mixture-of-Experts with Asynchrony: By dynamically queuing tokens at each layer (referred to as μ-queuing), GPUs avoid waiting for straggling experts and instead continuously process whichever layer is ready
-
ScheMoE: An Extensible Mixture-of-Experts Distributed Training System with Tasks Scheduling: scheduling comp and comm in MoE training, perhaps useful for MoE inference. accepted by EuroSys'24
-
ST-MoE: Designing Stable and Transferable Sparse Expert Models: a start work in MoE
-
Dynamic Mixture of Experts: An Auto-Tuning Approach for Efficient Transformer Models: algorithm change in MoE
-
Lancet: Accelerating Mixture-of-Experts Training via Whole Graph Computation-Communication Overlapping: Computation-Communication Overlapping, accepted by MLSys'24
-
Scaling Beyond the GPU Memory Limit for Large Mixture-of-Experts Model Training: training with offload, ICML'24
-
MPMoE: Memory Efficient MoE for Pre-Trained Models With Adaptive Pipeline Parallelism
-
Parm: Efficient Training of Large Sparsely-Activated Models with Dedicated Schedules: Dedicated Schedules for MP+EP+ESP MoE training, maybe work for infernece
-
Prediction Is All MoE Needs: Expert Load Distribution Goes from Fluctuating to Stabilizing: load is stabilized in the middle and late stages of training, but may not wrok greatly for insference
-
SmartMoE: Efficiently Training Sparsely-Activated Models through Combining Offline and Online Parallelization: parallel strategy of MoE, accepted by ATC'23
-
APTMoE: Affinity-Aware Pipeline Tuning for MoE Models on Bandwidth-Constrained GPU Nodes: fine-tune MoE models with CPU and some algorithm insights, accepted by SC'24
-
Prediction Is All MoE Needs: Expert Load Distribution Goes from Fluctuating to Stabilizing: prediction the expert workload to optimize training
-
FSMoE: A Flexible and Scalable Training System for Sparse Mixture-of-Experts Models: There isn't much of a novel technology(?), accepted by ASPLOS'25
- MOSEL: Inference Serving Using Dynamic Modality Selection: improving system throughput by 3.6x with an accuracy guarantee and shortening job completion times by 11x
- Generative AI Beyond LLMs: System Implications of Multi-Modal Generation: by META
- Speed Is All You Need: On-Device Acceleration of Large Diffusion Models via GPU-Aware Optimizations: by Google
- Accelerating Text-to-image Editing via Cache-enabled Sparse Diffusion Inference: optimization for diffusion models by cache
- DISTMM: Accelerating distributed multimodal model training: helpful although it is made for training, accepted by NSDI'24
- Addressing Model and Data Heterogeneity in Multimodal Large Language Model Training: distributed MM trainging
- Efficiently serving large multimedia models using EPD Disaggregation
- MPIC: Position-Independent Multimodal Context Caching System for Efficient MLLM Serving: position-independent caching, with both reuse and recompute, may lead to performance loss
- Characterizing and Efficiently Accelerating Multimodal Generation Model Inference: some insights
- ModServe: Scalable and Resource-Efficient Large Multimodal Model Serving: provide comprehensive systems analysis of two prominent LMM architectures, decoder-only and cross-attention
- HydraInfer: Hybrid Disaggregated Scheduling for Multimodal Large Language Model Serving: a Hybrid Encode-Prefill-Decode (EPD) Disaggregation architecture
- DistTrain: Addressing Model and Data Heterogeneity with Disaggregated Training for Multimodal Large Language Models: disaggregation in MM training, under guidence of Xin JIN
- Efficient Multi-Task Large Model Training via Data Heterogeneity-aware Model Management: efficient MM model training
- Spindle: Efficient Distributed Training of Multi-Task Large Models via Wavefront Scheduling: ASPLOS'25
- Approximate Caching for Efficiently Serving Text-to-Image Diffusion Models: serving Diffusion models, accepted by NSDI'24
- DiffusionPipe: Training Large Diffusion Models with Efficient Pipelines: accepted by MLSys'24
- SwiftDiffusion: Efficient Diffusion Model Serving with Add-on Modules: more papers in diffusion models
- PATCHEDSERVE: A Patch Management Framework for SLO-Optimized Hybrid Resolution Diffusion Serving: algorithm-based framework
- DiffServe: Efficiently Serving Text-to-Image Diffusion Models with Query-Aware Model Scaling: address the problem of serving text-to-image generation diffusion models in a query-aware resource-efficient manner by serving "easy" queries using a lightweight diffusion model without compromising image generation quality
What is this? maybe multiple LLM?
- Are More LLM Calls All You Need? Towards Scaling Laws of Compound Inference Systems: a new scenario, by Stanford
- ALTO: An Efficient Network Orchestrator for Compound AI Systems: also new to me, by Stanford
- Proteus: A High-Throughput Inference-Serving System with Accuracy Scaling: accuracy scaling is interesting, accepted by ASPLOS'24
- ⭐ MuxServe: Flexible Multiplexing for Efficient Multiple LLM Serving: multiple LLMs
- ROUTERBENCH: A Benchmark for Multi-LLM Routing System: but what is multi-LLM?
- Expert Router: Orchestrating Efficient Language Model Inference through Prompt Classification
- BlockLLM: Multi-tenant Finer-grained Serving for Large Language Models
- Prompt Cache: Modular Attention Reuse for Low-Latency Inference: prompt KV cache reuse, accepted by MLSys'24
- Preble: Efficient Distributed Prompt Scheduling for LLM Serving: similar to BlockLLM?
- Conveyor: Efficient Tool-aware LLM Serving with Tool Partial Execution: for LLM-based Applications
- Are More LLM Calls All You Need? Towards Scaling Laws of Compound Inference Systems
- RouteLLM: Learning to Route LLMs with Preference Data: use multiple LLMs for efficient serving
- USHER: Holistic Interference Avoidance for Resource Optimized ML Inference: inference several models simultaneously
- CoServe: Efficient Collaboration-of-Experts (CoE) Model Inference with Limited Memory: a new scenario Collaboration-of-Experts instead of mixture-of-experts, provide some new oppotunities, acceped by ASPLOS'25
- ⭐ SeaLLM: Service-Aware and Latency-Optimized Resource Sharing for Large Language Model Inference: enables service-aware and latency-optimized LLM sharing on same device
- HybridServe: Efficient Serving of Large AI Models with Confidence-Based Cascade Routing: use multiple small model to achieve the accuracy of serving solely with giant DNNs(?)
- Teola: Towards End-to-End Optimization of LLM-based Applications: endd-to-end optimization
- Parrot: Efficient Serving of LLM-based Applications with Semantic Variable: accepted by OSDI'24
- Orion: Interference-aware, Fine-grained GPU Sharing for ML Applications: many LLM apps share GPU, accepted by EuroSys'24
- Why Do Multi-Agent LLM Systems Fail?: learn algorithm from it
- ⭐ Autellix: An Efficient Serving Engine for LLM Agents as General Programs: multi-agent has something similar to LLM application, scheduling and preemption
- Fast Inference for Augmented Large Language Models: seems a subclass of multi-agent
- Towards End-to-End Optimization of LLM-based Applications with Ayo: utilizes task primitives as the basic units and represents each query's workflow as a primitive-level dataflow graph, enables optimizations in parallelization, pipelining across primitives of different modules, and enhances scheduling to improve application-level performance
- Improving the End-to-End Efficiency of Offline Inference for Multi-LLM Applications Based on Sampling and Simulation: multi-LLM's end-to-end running
- Tempo: Application-aware LLM Serving with Mixed SLO Requirements: Meet SLO requirements for all services in the system and accelerate the overall process, can be used in multi-agent application
- Characterization of Large Language Model Development in the Datacenter: fault-tolerant serving in the future?
- Lazarus: Resilient and Elastic Training of Mixture-of-Experts Models with Adaptive Expert Placement: Fault Tolerance in MoE training
- Partial Experts Checkpoint: Efficient Fault Tolerance for Sparse Mixture-of-Experts Model Training: checkpointing in MoE
It is usually related to CPU-GPU heterogeneity and GPU power consumption.
- DynamoLLM: Designing LLM Inference Clusters for Performance and Energy Efficiency
- Offline Energy-Optimal LLM Serving: Workload-Based Energy Models for LLM Inference on Heterogeneous Systems
- Apparate: Rethinking Early Exits to Tame Latency-Throughput Tensions in ML Serving: early exits, accepted by SOSP'24
- Improving DNN Inference Throughput Using Practical, Per-Input Compute Adaptation: early exits and some system optimization, accepted by SOSP'24
- OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework: framework for RLHF
- HybridFlow: A Flexible and Efficient RLHF Framework: framework for RLHF, accepted by EuroSys'25
- RLHFuse: Efficient RLHF Training for Large Language Models with Inter- and Intra-Stage Fusion
- Systems Opportunities for LLM Fine-Tuning using Reinforcement Learning: optimization for LLM Fine-Tuning using Reinforcement Learning
- ReaL: Efficient RLHF Training of Large Language Models with Parameter Reallocation: accepted by MLSYS'25
Wise men learn by others.
- Orca 2: Teaching Small Language Models How to Reason
- FiDO: Fusion-in-Decoder optimized for stronger performance and faster inference: optimization for retrieval-augmented language model
- Optimizing Dynamic Neural Networks with Brainstorm: this idea has the potential to go further, accepted by OSDI'23
- Ring Attention with Blockwise Transformers for Near-Infinite Context: Ring Attention?
- Reducing Activation Recomputation in Large Transformer Models: by NVIDIA
- Cheaply Estimating Inference Efficiency Metrics for Autoregressive Transformer Models: an interesting performance metric, accepted by NIPS'23
- FEC: Efficient Deep Recommendation Model Training with Flexible Embedding Communication: accepted by SIGMOD'23
- Efficient Multi-GPU Graph Processing with Remote Work Stealing: accepted by ICDE'23
- ARK: GPU-driven Code Execution for Distributed Deep Learning: accepted by NSDI'23
- Sequential Aggregation and Rematerialization: Distributed Full-batch Training of Graph Neural Networks on Large Graphs: accepted by MLSys'22
- Golgi: Performance-Aware, Resource-Efficient Function Scheduling for Serverless Computing: Scheduling for Serverless Computing
- FastFold: Optimizing AlphaFold Training and Inference on GPU Clusters: expand to other ML models instead of LLM
- Arrow Matrix Decomposition: A Novel Approach for Communication-Efficient Sparse Matrix Multiplication
- FinGPT-HPC: Efficient Pretraining and Finetuning Large Language Models for Financial Applications with High-Performance Computing
- Two-Face: Combining Collective and One-Sided Communication for Efficient Distributed SpMM: efficient SpMM, accepted by ASPLOS'24
- GMLake: Efficient and Transparent GPU Memory Defragmentation for Large-scale DNN Training with Virtual Memory Stitching: GPU memory pool, accepted by ASPLOS'24
- QuickLLaMA: Query-aware Inference Acceleration for Large Language Models: an inference-friendly LLaMA architecture
- Marconi: Prefix Caching for the Era of Hybrid LLMs: prefix cache for new model arch like combine attention with SSM
- Comprehensive Deadlock Prevention for GPU Collective Communication: communication library
I'd like to create a separate area for data flows. It's just my preference.
- ⭐ FLAT: An Optimized Dataflow for Mitigating Attention Bottlenecks: dataflow in inference
- Pathways: Asynchronous Distributed Dataflow for ML: accepted by MLSys'22
- VirtualFlow: Decoupling Deep Learning Models from the Underlying Hardware: accepted by MLSys'22
- NeuStream: Bridging Deep Learning Serving and Stream Processing: dataflow in DNN serving, accepted by EuroSys'25
How about data pre-processing overhead in training?
Just my preference.
- Boosting Distributed Full-graph GNN Training with Asynchronous One-bit Communication
- GNNPipe: Scaling Deep GNN Training with Pipelined Model Parallelism
- PckGNN: Optimizing Aggregation Operators with Packing Strategies in Graph Neural Networks: accepted by IPDPS'24
- NPA: Improving Large-scale Graph Neural Networks with Non-parametric Attention: SIGMOD'24
- Eliminating Data Processing Bottlenecks in GNN Training over Large Graphs via Two-level Feature Compression: compress node features in graph, accepted by VLDB'24
- Mega: More Efficient Graph Attention for GNNs: optimize graph attention efficiency, ICDCS'24
- TORCHGT: A Holistic System for Large-Scale Graph Transformer Training: graph transformer model
Just my preference, too.