LLM-inference-optimization-paper

Summary of some awesome works for optimizing LLM inference

This summary will including three parts:

some repositories that you can follow
some representative person or labs that you can follow
some important works in the different research interests

Repositories

For example, LLMSys-PaperList contains many excellent articles, and is keeping updating (which I believe is the most important for a paperlist). Awesome-LLM-Inference and Awesome_LLM_Accelerate-PaperList are also worth reading.

Besides, awesome-AI-system works also very well. And you can find other repositories in its content.

The log "Large Transformer Model Inference Optimization" helps me a lot at the beginning.

This log OpenAI Keynote on Building Scalable AI Infrastructure seems to be a laeding guidance.

Person/Lab

Follow others' research, and find yourself's idea.

It is not my intention to judge the work of these pioneers, and I understand that the shortness of my knowledge will lead me to leave out many important people. If you have a different opinion, please feel free to communicate with me through the issue.
In no particular order!!
Damn, I can't remember the names of foreigners.

Zhihao JIA: FlexFlow and other imporessive work, important role in MLSys, affiliated with CMU
Tianqi CHEN: TVM, XGBoost, and other imporessive work, important role in Machine Learning System and ML compilers, affiliated with CMU
Song HAN: many important work in efficient ML including sparsity and quantization. btw, the class TinyML and Efficient Deep Learning Computing is highly recommanded, affiliated with MIT Zhen DONG: many important work in quantization and high-performance ML, affiliated with UCB
Tri DAO: author of FlashAttention, affiliated with Princeton
Ce ZHANG: famous in efficient MLsys, affiliated with UChicago
Ion Stoica: Alpa, Ray, Spark, et.al.

SPCL: Scalable Parallel Computing Lab, affiliated with ETHz
Luo MAI: affiliated with University of Edinburgh

IPADS: focus more on PURE systems, buut also make great progress in MLSys, affiliated with SJTU
EPCC: Emerging Parallel Computing Center, parallel computing and MLSys are Naturally combined, affiliated with SJTU

Xin JIN: FastServe and LLMCad are impressive work, affiliated with PKU
Bin CUI: important role in MLSys including DL, GNN, and MoE, affiliated with PKU
Jidong ZHAI: leading many important work in MLSys, affiliated with THU
Lingxiao MA: with many important work in MLSys on Top-Conference, affiliated with MSRA
Cheng LI: high performce system and MLSys, affiliated with USTC
Xupeng Miao: SpotServe, SpecInfer, HET, et.al

Chuan WU: with some important work in distributed machine learning systems, affiliated with HKU James CHENG: affiliated with CUHK
Kai CHEN: database works well with MLSys, affiliated with HKUST
Lei CHEN: database works well with MLSys, many papers so I recommand u to focus on his Top-Conference paper, affiliated with HKUST
Yang YOU: leader of Colossal-AI, affiliated with NUS
Wei WANG: work in System and MLSys, affiliated with HKUST

Work

I hope to conlude these impressive works based on their research direction.
But my summary must not be informative enough, and I am looking forward to your addition.

Perhaps someone should write a detailed survey.

Periodically check the "cited by" of the papers with ⭐ will be helpful.
Paragraphs with 💡 are not perfect.

Survey/Evaluations/Benchmarks 💡

Make useful benchmark or evaluation is helfpul.

Technical reports of the enterprise

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning: deepseek: mla + moe
Every FLOP Counts: Scaling a 300B Mixture-of-Experts LING LLM without Premium GPUs: moe training with lower-specification hardware

Interesting NEW Frameworks in Parallel Decoding

Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads, pdf

prior paper: Blockwise Parallel Decoding for Deep Autoregressive Models

Break the Sequential Dependency of LLM Inference Using Lookahead Decoding: by lookahead decoding

Both frameworks use parallel decoding, and deserve a more detailed research.

Benchmark LLM Inference framework

vllm-project/aibrix

Papers for Parallel Decoding

There are some interesting papers about parallel decoding.

Complex Inference

In fact, I'm not so familiar with with topic. But perhaps OpenAI 4o1 used this...
Spend more time inferencing than pre-training

⭐ Large Language Monkeys: Scaling Inference Compute with Repeated Sampling: Starter material, apply repeated sampling
⭐ Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters: Starter material, scaling LLM Test-Time to improve accuracy
Strategic Chain-of-Thought: Guiding Accurate Reasoning in LLMs through Strategy Elicitation: seems fewer people have explore the efficiency of CoT; a two-stage method gives me some throught
Fast Best-of-N Decoding via Speculative Rejection: optimize alignment in inference, accepted by NIPS'24
S*: Test Time Scaling for Code Generation: perhaps can do some acceleration on Test Time Scaling
Thinking Short and Right Over Thinking Long: Serving LLM Reasoning Efficiently and Accurately: manage the CoT

GPT-o1

This topic is about GPT-o1, aka the strawberry.

⭐ Reverse engineering OpenAI’s o1: a leading blog for introduction in OpenAI’s o1
⭐ Chain-of-Thought Prompting Elicits Reasoning in Large Language Models: base work
Tree of Thoughts: Deliberate Problem Solving with Large Language Models: a improment based on CoT
Large Language Model Guided Tree-of-Thought: also a ToT
Let's Verify Step by Step: verify by step can be helpful
Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models: what is Language Agent Tree Search (LATS)? accepted by ICML'24
Critique-out-Loud Reward Models
Generative Verifiers: Reward Modeling as Next-Token Prediction: a verifier, by DeepMind

Speculative Decoding

Also named as Speculative Sampling, model collaboration.

different model collaboration

Think Big, Generate Quick: LLM-to-SLM for Fast Autoregressive Decoding: use both LLM and SLM

Skeleton-of-Thought

Adaptive Skeleton Graph Decoding: successor of Skeleton-of-Thought

3D Parallelism 💡

Some knowledege about data parallel, model tensor parallel, and model pipeline parallel will help in this track.

Communication Overlap

Prefill-Decode disaggregation

Ignore some of the earliest papers and focus on the latest work to optimize this.

⭐ Seesaw: High-throughput LLM Inference via Model Re-sharding: dynamic model re-sharding to facilitates the dynamic reconfiguration of parallelization strategies across stages, reduce the overhead caused by frequent stage transitions (seems like Elastic Scheduling)
DynamicAttention: Dynamic KV Cache for Disaggregate LLM Inference: DynamicAttention, it allocates a continuous virtual GPU memory space at startup, but does not actually allocate physical GPU memory?
⭐ semi-PD: Towards Efficient LLM Serving via Phase-Wise Disaggregated Computation and Unified Storage: disaggregated computation and unified storage, a computation resource controller to achieve disaggregated computation at the streaming multi-processor (SM) level, and a unified memory manager to manage the asynchronous memory access from both phases
ShuffleInfer: Disaggregate LLM Inference for Mixed Downstream Workloads: requests grouping, disaggregation and resource scheduling
Arrow: Adaptive Scheduling Mechanisms for Disaggregated LLM Inference Architecture: dynamically adjusts the number of instances handling prefill and decode tasks based on real-time cluster performance metrics
DynaServe: Unified and Elastic Execution for Dynamic Disaggregated LLM Serving: arbitrarily splits each request at any token boundary into at most two cooperating segments, then use a two-level scheduling framework then balances micro-request load across unified GPU instances

Prune & Sparsity 💡

An enduring topic in efficient machine learning.
We mainly focus on Semi-structured and Structured pruning becasue they can accelerate computing.

Quantization 💡

Low-precision for memory and computing efficiency.

Batch Processing

Perhaps the most important way for improving the throughput in LLM inference.
This blog Dissecting Batching Effects in GPT Inference helps me a lot at the beginning.

Update2023/12/12: I'd like to use Continues Batching to take place of the Dynamic Batching I used before. The name Dynamic Batching is more likely to be used in Triton.

Chunked prefill

Bullet: Boosting GPU Utilization for LLM Serving via Dynamic Spatial-Temporal Orchestration: fine-grained chunked prefill with decode, but what is SM-masked stream?

Computing Optimization

This part include some impressive work optimizing LLM computing by observing the underlying computing properties. Such as FlashAttention, et.al.

FlashAttention Family

Optimization focus on Auto-regressive Decoding

Splitwise: Efficient generative LLM inference using phase splitting: splitting prefill and decode in a map-reduce style, by UW and Microsoft
DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving: also split the prefill and decode, accepted by OSDI'24
Inference without Interference: Disaggregate LLM Inference for Mixed Downstream Workloads: seems a combination of SARATHI and Splitwise
ExeGPT: Constraint-Aware Resource Scheduling for LLM Inference: similar to splitwise, accepted by ASPLOS'24
Splitwiser: Efficient LLM Inference with Constrained Resources
ToEx: Accelerating Generation Stage of Transformer-based Language Models via Token-adaptive Early Exit: Token-adaptive Early Exit

Kernels Optimization

Automatic Task Parallelization of Dataflow Graphs in ML/DL models
MonoNN: Enabling a New Monolithic Optimization Space for Neural Network Inference Tasks on Modern GPU-Centric Architectures: compilation optimization on compuataion graph
POD-Attention: Unlocking Full Prefill-Decode Overlap for Faster LLM Inference: optimize attention kernel in mix-batching
Focus: High-Performant and Customizable Attention Engine for LLM Serving: flexible attention engine, advised by Chen Tianqi and accepted by MLSYS'25
ML-Triton, A Multi-Level Compilation and Language Extension to Triton GPU Programming: Multi-level Triton

Memory Manage

This part is inspired by PagedAttention of vLLM. And there are many Top-Conference paper discussing the memory management in DL computing on GPUs.

Prefix Sharing

note: some papers about prefix sharing is not in this section

LLM Query Scheduling with Prefix Reuse and Latency Constraints: balancing prefix reuse and fairness in query scheduling
Marconi: Prefix Caching for the Era of Hybrid LLMs: prefix caching target at State Space Models

Inference on hardware: GPUs, CPUs or based on SSD

Large Language Model Inference Acceleration: A Comprehensive Hardware Perspective: a helpful survey
ADOR: A Design Exploration Framework for LLM Serving with Enhanced Latency and Throughput: balances throughput and latency under different hardware
Understanding and Optimizing Multi-Stage AI Inference Pipelines: a Heterogeneous Multi-stage LLM inference Execution Simulator

Underlying optimization for GPU

Reducing shared memory footprint to leverage high throughput on Tensor Cores and its flexible API extension library: implement some APIs to reduce the shared memory footprint, accepted in HPC Asia'23
Benchmarking and Dissecting the Nvidia Hopper GPU Architecture: help us understand GPUs
SLO-aware GPU Frequency Scaling for Energy Efficient LLM Inference Serving: optimizing energy consuming based on lower GPU frequency
Foreseer: Knowledge-Driven Acceleration of Memory-Bound Matrix Multiplications for Large Language Model Inference: similar to cutlass, optimization on intel GPU
Tackling the Dynamicity in a Production LLM Serving System with SOTA Optimizations via Hybrid Prefill/Decode/Verify Scheduling on Efficient Meta-kernels: for Ascend GPU (perhaps also work for NVIDIA?)
MEPipe: Democratizing LLM Training with Memory-Efficient Slice-Level Pipeline Scheduling on Cost-Effective Accelerators: maybe inference on RTX4090?
PASK: Cold Start Mitigation for Inference with Proactive and Selective Kernel Loading on GPUs: DAC's paper for hardware

GPU sharing

⭐ Hardware Compute Partitioning on NVIDIA GPUs: spatially partition the computing units of NVIDIA GPUs transparently, worth reading
Improving GPU Sharing Performance through Adaptive Bubbleless Spatial-Temporal Sharing: Bless leverages precise computing resource management and fine-grained kernel scheduling to ensure stringent quota guarantees and reduce latency fairly for applications with varying GPU quotas, accepted by EuroSys'25

CPUs or based on SSD

Heterogeneous scenarios or single PC are becoming increasingly important.

Making optimization for the calculating on CPU or SSD will have different methods.

Inference on personal device

Inspired by AI PC, open up a new area.
Including edge systems now.

Heterogeneous or decentralized environments

FusionAI: Decentralized Training and Deploying LLMs with Massive Consumer-Level GPUs: decentrailized system on consumer-level GPUs, through there will be some problems
Distributed Inference and Fine-tuning of Large Language Models Over The Internet: some techs in this paper will be instructive
⭐ HeteGen: Heterogeneous Parallel Inference for Large Language Models on Resource-Constrained Devices: heterogeneous parallel computing using CPUs and GPUs
Metis: Fast Automatic Distributed Training on Heterogeneous GPUs: accepted by ATC'24
Helix: Distributed Serving of Large Language Models via Max-Flow on Heterogeneous GPUs: we can get performance model for Heterogeneous GPUs cluster and learn the algorithm analyse
Mélange: Cost Efficient Large Language Model Serving by Exploiting GPU Heterogeneity: making heterogeneity-aware GPU provisioning decisions for LLM serving

Algorithm Optimization 💡

In this part, researchers provide some algorithm-based method to optimizing LLM inference.

Industrial Inference Frameworks 💡

LLM Serving 💡

LLM server providers will focus on this part. Engineering practices are just as important as algorithm optimization.

LLM as microservice

⭐ A System for Microserving of LLMs: seems a idea and industrial practice that makes sense

Serverless LLM serving

DeepFlow: Serverless Large Language Model Serving at Scale: provide fine-grained LLM service
⭐ Towards Swift Serverless LLM Cold Starts with ParaServe: pipeline parallelism and dynamic adjust parallelism strategy, and accelerate cold-start
λScale: Enabling Fast Scaling for Serverless Large Language Model Inference: serverless inference system to achieve fast model scaling, by fast model multicast, inference execution during model transmission and dynamically constructs execution pipelines
Medusa: Accelerating Serverless LLM Inference with Materialization: target at cold-start of LLM serverlesss, to solve the available KV cache blocks profiling and cuda graph capture problems, accepted by ASPLOS'25
SMore: Enhancing GPU Utilization in Deep Learning Clusters by Serverless-based Co-location Scheduling: serverless computing reveals an opportunity to optimize gpu utilization with fine-grained resource allocation
PipeBoost: Resilient Pipelined Architecture for Fast Serverless LLM Scaling: rapidly launch inference services in response to bursty requests without preemptively over-provisioning GPUs

Request Scheduling

Shared Prefix Serving

ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and Two-Phase Partition: share prefix and optimize KV Cache

Serving for LoRA

For LoRA but not serving

Combining fine-tuning/training with inference

Deferred Continuous Batching in Resource-Efficient Large Language Model Serving
Latency-Guaranteed Co-Location of Inference and Training for Reducing Data Center Expenses: place training and inference together, control the inference latency to the desired SLO, while maximizing the throughput of the training jobs co-located on the same GPUs, accepted by ICDCS'24

Serving Long-Context

Long-Context is a hot point recently.

Challenges in Deploying Long-Context Transformers: A Theoretical Peak Performance Analysis
Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference: like a update for H2O or Dejevu, et.al, each attention head have different memory budget
Context Parallelism for Scalable Million-Token Inference
TokenSelect: Efficient Long-Context Inference and Length Extrapolation for LLMs via Dynamic Token-Level KV Cache Selection: select some important KV cache to take part in attention computation

Complex ML loads

Process differnet ML loads in a cluster.

PAL: A Variability-Aware Policy for Scheduling ML Workloads in GPU Clusters: serve multiple different loads in GPU cluster, accepted by SC'24
PipeLLM: Fast and Confidential Large Language Model Services with Speculative Pipelined Encryption: why Encryption in LLM inference? by IPADS, accepted by ASPLOS'25
Topology-aware Preemptive Scheduling for Co-located LLM Workloads: schedule different workloads

RAG with LLM

Combine MoE with LLM inference

Here are two repositories have some papers for MoE: Papers: MoE/Ensemble, and MOE papers to read

MoE training

Inference with multimodal

Training in Multimodal

DistTrain: Addressing Model and Data Heterogeneity with Disaggregated Training for Multimodal Large Language Models: disaggregation in MM training, under guidence of Xin JIN
Efficient Multi-Task Large Model Training via Data Heterogeneity-aware Model Management: efficient MM model training
Spindle: Efficient Distributed Training of Multi-Task Large Models via Wavefront Scheduling: ASPLOS'25

Diffusion Models

Approximate Caching for Efficiently Serving Text-to-Image Diffusion Models: serving Diffusion models, accepted by NSDI'24
DiffusionPipe: Training Large Diffusion Models with Efficient Pipelines: accepted by MLSys'24
SwiftDiffusion: Efficient Diffusion Model Serving with Add-on Modules: more papers in diffusion models
PATCHEDSERVE: A Patch Management Framework for SLO-Optimized Hybrid Resolution Diffusion Serving: algorithm-based framework
DiffServe: Efficiently Serving Text-to-Image Diffusion Models with Query-Aware Model Scaling: address the problem of serving text-to-image generation diffusion models in a query-aware resource-efficient manner by serving "easy" queries using a lightweight diffusion model without compromising image generation quality

Compound Inference Systems

What is this? maybe multiple LLM?

LLM Application

Teola: Towards End-to-End Optimization of LLM-based Applications: endd-to-end optimization
Parrot: Efficient Serving of LLM-based Applications with Semantic Variable: accepted by OSDI'24
Orion: Interference-aware, Fine-grained GPU Sharing for ML Applications: many LLM apps share GPU, accepted by EuroSys'24
Why Do Multi-Agent LLM Systems Fail?: learn algorithm from it
⭐ Autellix: An Efficient Serving Engine for LLM Agents as General Programs: multi-agent has something similar to LLM application, scheduling and preemption
Fast Inference for Augmented Large Language Models: seems a subclass of multi-agent
Towards End-to-End Optimization of LLM-based Applications with Ayo: utilizes task primitives as the basic units and represents each query's workflow as a primitive-level dataflow graph, enables optimizations in parallelization, pipelining across primitives of different modules, and enhances scheduling to improve application-level performance
Improving the End-to-End Efficiency of Offline Inference for Multi-LLM Applications Based on Sampling and Simulation: multi-LLM's end-to-end running
Tempo: Application-aware LLM Serving with Mixed SLO Requirements: Meet SLO requirements for all services in the system and accelerate the overall process, can be used in multi-agent application

Fault Tolerance

Characterization of Large Language Model Development in the Datacenter: fault-tolerant serving in the future?
Lazarus: Resilient and Elastic Training of Mixture-of-Experts Models with Adaptive Expert Placement: Fault Tolerance in MoE training
Partial Experts Checkpoint: Efficient Fault Tolerance for Sparse Mixture-of-Experts Model Training: checkpointing in MoE

Energy Optimization

It is usually related to CPU-GPU heterogeneity and GPU power consumption.

Early Exits

Apparate: Rethinking Early Exits to Tame Latency-Throughput Tensions in ML Serving: early exits, accepted by SOSP'24
Improving DNN Inference Throughput Using Practical, Per-Input Compute Adaptation: early exits and some system optimization, accepted by SOSP'24

RLHF

OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework: framework for RLHF
HybridFlow: A Flexible and Efficient RLHF Framework: framework for RLHF, accepted by EuroSys'25
RLHFuse: Efficient RLHF Training for Large Language Models with Inter- and Intra-Stage Fusion
Systems Opportunities for LLM Fine-Tuning using Reinforcement Learning: optimization for LLM Fine-Tuning using Reinforcement Learning
ReaL: Efficient RLHF Training of Large Language Models with Parameter Reallocation: accepted by MLSYS'25

Some Interesting Idea

Wise men learn by others.

Dataflow

I'd like to create a separate area for data flows. It's just my preference.

⭐ FLAT: An Optimized Dataflow for Mitigating Attention Bottlenecks: dataflow in inference
Pathways: Asynchronous Distributed Dataflow for ML: accepted by MLSys'22
VirtualFlow: Decoupling Deep Learning Models from the Underlying Hardware: accepted by MLSys'22
NeuStream: Bridging Deep Learning Serving and Stream Processing: dataflow in DNN serving, accepted by EuroSys'25

How about data pre-processing overhead in training?

Pecan: Cost-Efficient ML Data Preprocessing with Automatic Transformation Ordering and Hybrid Placement

GNN

Just my preference.

Boosting Distributed Full-graph GNN Training with Asynchronous One-bit Communication
GNNPipe: Scaling Deep GNN Training with Pipelined Model Parallelism
PckGNN: Optimizing Aggregation Operators with Packing Strategies in Graph Neural Networks: accepted by IPDPS'24
NPA: Improving Large-scale Graph Neural Networks with Non-parametric Attention: SIGMOD'24
Eliminating Data Processing Bottlenecks in GNN Training over Large Graphs via Two-level Feature Compression: compress node features in graph, accepted by VLDB'24
Mega: More Efficient Graph Attention for GNNs: optimize graph attention efficiency, ICDCS'24
TORCHGT: A Holistic System for Large-Scale Graph Transformer Training: graph transformer model

Blockchain

Just my preference, too.

Weaving the Cosmos: WASM-Powered Interchain Communication for AI Enabled Smart Contracts

Name		Name	Last commit message	Last commit date
Latest commit History 80 Commits
.gitignore		.gitignore
ConferenceList_24.md		ConferenceList_24.md
ConferenceList_25.md		ConferenceList_25.md
README.md		README.md

chenhongyu2048/LLM-inference-optimization-paper

Folders and files

Latest commit

History

Repository files navigation