JustQJ / awesome-moe-inference Public

Notifications You must be signed in to change notification settings
Fork 0
Star 1

Curated collection of papers in MoE model inference

1 star 0 forks Branches Tags Activity

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 69 Commits
README.md		README.md

Repository files navigation

Awesome MoE LLM Inference System and Algorithm

A curated list of awesome papers about optimizing the inference of MoE-based LLMs.

Example: [Conference'year] Paper Title [Code]

Contents

Survey

[Preprints'24.8] The Evolution of Mixture of Experts: A Survey from Basics to Breakthroughs

[Arxiv'24.8] A Survey on Mixture of Experts [Code]

[Arxiv'22] A Review of Sparse Expert Models in Deep Learning

SOTA Open Source MoE LLMs

Reference	Para.	Experts	#L	#H	$d_{model}$	$d_{ffn}$	$d_{expert}$	Affiliation	Time
NLLB	54B	2/64/0	24	16	1024	8192	8192	FaceBook	2022.07
Qwen2-57B-A14B	57.4B	8/64/0	28	28	3584	18944	2560	Alibaba	2023.05
Mixtral-8x7B	46.7B	2/8/0	32	32	4096	14336	14336	Mistral AI	2023.12
OpenMoE	34B	2/16/0	12	12	768	2048	2048	NUS et al.	2023.12
DeepSeekMoE	16.4B	6/64/2	28	16	2048	10944	1408	DeepSeek-AI	2024.01
Qwen1.5-MoE	14.3B	4/60/0	24	16	2048	5632	1408	Alibaba	2024.02
JetMoE	8.52B	2/8/0	24	32	2048	5632	5632	MIT et al.	2024.03
Jamba	51.6B	2/16/0	32	32	4096	14336	14336	ai21labs	2024.03
DBRX	132B	4/16/0	40	48	6144	10752	10752	Databricks	2024.03
Grok-1	314B	2/8/0	64	48	6144	UNK	UNK	xAI	2024.03
Arctic	482B	2/128/0	35	56	7168	4864	4864	Snowflake	2024.04
Mixtral-8x22B	141B	2/8/0	56	48	6144	16384	16384	Mistral AI	2024.04
DeepSeek-V2	236B	6/160/2	60	128	5120	12288	1536	DeepSeek-AI	2024.04
Skywork-MoE	13B	2/16/0	52	36	4608	12288	12288	Kunlun Tech	2024.05
Yuan2	40B	2/32/0	24	16	2048	8192	8192	IEIT-Yuan	2024.05
LLaMA-MoE	6.7B	2/8/0	32	32	4096	11008	11008	Zhu et al.	2024.06
OLMoE	6.92B	8/64/0	16	16	2048	1024	1024	AllenAI	2024.07
Phi-3	41.9B	2/16/0	32	32	4096	6400	6400	MicroSoft	2024.08
GRIN-MoE	41.9B	2/16/0	32	32	4096	6400	6400	MicroSoft	2024.09
Hunyuan-Large	389B	1/16/1	64	80	6400	18304	18304	Tencent	2024.11

[Arxiv'24.11] Hunyuan-Large [Code]

[Arxiv'24.1] Mixtral-8x7B [Code]

[Arxiv'24.1] Mixtral-8x22B [Code]

[Arxiv'24.1] DeepseekMoE [Code]

[Arxiv'24.6] DeepSeek-V2 [Code]

[Arxiv'24.8] PhiMoE [Code]

[Arxiv'24.9] GRadient-INformed MoE [Code]

[Arxiv'24.9] Qwen2-57B-A14B [Code]

[QwenBlog'24.3] Qwen1.5-MoE [Code]

[Arxiv'24.9] OLMoE: Open Mixture-of-Experts Language Models [Code]

[Arxiv'24.3] OpenMoE: An Early Effort on Open Mixture-of-Experts Language Models [Code]

[Arxiv'24.6] Skywork-MoE [Code]

[Arxiv'24.4] JetMoE: Reaching Llama2 Performance with 0.1M Dollars[Code]

[Arxiv'24.5] Yuan 2.0-M32 [Code]

[MosaicResearchBlog'24.3] DBRX [Code]

[SnowflakeBlog'24.4] Arctic [Code]

[XAIBlog'24.3] Grok-1 [Code]

[Arxiv'24.7] Jamba [Code]

[Arxiv'24.6] LLaMA-MoE [Code]

[Arxiv'22] NLLB-MOE [Code]

[ICCV'21] Swin-MoE [Code]

Model-Level Optimizations

Efficient Architecture Design

Attention Module

[Arxiv'24.8] BAM! Just Like That: Simple and Efficient Parameter Upcycling for Mixture of Experts

[Arxiv'24.10] MoH: Multi-Head Attention as Mixture-of-Head Attention [Code]

[Arxiv'24.4] Dense Training, Sparse Inference: Rethinking Training of Mixture-of-Experts Language Models

[Arxiv'24.4] JetMoE: Reaching Llama2 Performance with 0.1M Dollars[Code]

[NeurIPS'24.10] MoEUT: Mixture-of-Experts Universal Transformers [Code]

[NeurIPS'24.9] SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention [Code]

[Arxiv'23] ModuleFormer: Modularity Emerges from Mixture-of-Experts [Code]

[Arxiv'23] Sparse Universal Transformer

[EMNLP'22] Mixture of Attention Heads: Selecting Attention Heads Per Token [Code]

[ACL'20] A Mixture of h - 1 Heads is Better than h Heads

MoE Module

[Arxiv'24.10] MoE++: Accelerating Mixture-of-Experts Methods with Zero-Computation Experts [Code]

[Arxiv'24.2] MoELoRA: Contrastive Learning Guided Mixture of Experts on Parameter-Efficient Fine-Tuning for Large Language Models

[Arxiv'23] Pre-gated MoE: An Algorithm-System Co-Design for Fast and Scalable Mixture-of-Expert Inference [Code]

[ICLR'23] SCoMoE: Efficient Mixtures of Experts with Structured Communication

[KDD'23] COMET: Learning Cardinality Constrained Mixture of Experts with Trees and Local Search

Model Compression

Pruning

[Arxiv'24.10] MoE-Pruner: Pruning Mixture-of-Experts Large Language Model using the Hints from Its Router

[Arxiv'24.4] SEER-MoE: Sparse Expert Efficiency through Regularization for Mixture-of-Experts

[Arxiv'24.10] Diversifying the Expert Knowledge for Task-Agnostic Pruning in Sparse Mixture-of-Experts

[Arxiv'24.7] Efficient Expert Pruning for Sparse Mixture-of-Experts Language Models: Enhancing Performance and Reducing Inference Costs [Code]

[ACL'24.5] Not All Experts are Equal: Efficient Expert Pruning and Skipping for Mixture-of-Experts Large Language Models [Code]

[Arxiv'24.9] Revisiting SMoE Language Models by Evaluating Inefficiencies with Task Specific Expert Pruning

[Arxiv'24.9] STUN: Structured-Then-Unstructured Pruning for Scalable MoE Pruning

[Arxiv'24.6] Demystifying the Compression of Mixture-of-Experts Through a Unified Framework [Code]

[Arxiv'24.5] A Provably Effective Method for Pruning Experts in Fine-tuned Sparse Mixture-of-Experts

[Arxiv'24.11] MoE-I2: Compressing Mixture of Experts Models through Inter-Expert Pruning and Intra-Expert Low-Rank Decomposition [Code]

[ICLR'24.3] Merge, Then Compress: Demystify Efficient SMoE with Hints from Its Routing Policy [Code]

[Arxiv'23] ModuleFormer: Modularity Emerges from Mixture-of-Experts [Code]

[Arxiv'22] Task-Specific Expert Pruning for Sparse Mixture-of-Experts

Quantization

[Arxiv'24.10] MC-MoE: Mixture Compressor for Mixture-of-Experts LLMs Gains More [Code]

[Arxiv'23] Mixture of Quantized Experts (MoQE): Complementary Effect of Low-bit Quantization and Robustness

[Arxiv'23] QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models [Code]

[Arxiv'24.11] HOBBIT: A Mixed Precision Expert Offloading System for Fast MoE Inference

[Arxiv'24.9] Mixture of Experts with Mixture of Precisions for Tuning Quality of Service

[Arxiv'24.6] Examining Post-Training Quantization for Mixture-of-Experts: A Benchmark [Code]

[INTERSPEECH'23] Compressed MoE ASR Model Based on Knowledge Distillation and Quantization

[Arxiv'23] EdgeMoE: Fast On-Device Inference of MoE-based Large Language Models [Quantization]

[EMNLP'22] Who Says Elephants Can't Run: Bringing Large Scale MoE Models into Cloud Scale Production

Knowledge Distillation

[Arxiv'24.10] LLaVA-MoD: Making LLaVA Tiny via MoE-Knowledge Distillation

[Arxiv'24.8] LaDiMo: Layer-wise Distillation Inspired MoEfier

[INTERSPEECH'23] Compressed MoE ASR Model Based on Knowledge Distillation and Quantization

[ICML'22] DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale [Code]

[MICROSOFT'22] Knowledge distillation for mixture of experts models in speech recognition

[Arxiv'22] One Student Knows All Experts Know: From Sparse to Dens

[JMLR'22] Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

[Arxiv'21] Efficient Large Scale Language Modeling with Mixtures of Experts

Low Rank Decomposition

[Arxiv'24.11] MoE-I2: Compressing Mixture of Experts Models through Inter-Expert Pruning and Intra-Expert Low-Rank Decomposition [Code]

[ICLR'24.3] Merge, Then Compress: Demystify Efficient SMoE with Hints from Its Routing Policy [Code]

[Arxiv'22] Parameter-Efficient Mixture-of-Experts Architecture for Pre-trained Language Models [Code]

Expert Skip/Adaptive Gating

[Arxiv'24.8] AdapMoE: Adaptive Sensitivity-based Expert Gating and Management for Efficient MoE Inference [Code]

[ACL'24.8] XMoE: Sparse Models with Fine-grained and Adaptive Expert Selection

[Arxiv'23] Dynamic Mixture of Experts: An Auto-Tuning Approach for Efficient Transformer Models [Code]

[Arxiv'23] Adaptive Gating in Mixture-of-Experts based Language Models

[Arxiv'23] Towards MoE Deployment: Mitigating Inefficiencies in Mixture-of-Expert (MoE) Inference

Merge Expert

[Arxiv'24.10] Retraining-Free Merging of Sparse Mixture-of-Experts via Hierarchical Clustering

[EMNLP'23] Merging Experts into One: Improving Computational Efficiency of Mixture of Experts

[Arxiv'24.3] Branch-Train-MiX:Mixing Expert LLMs into a Mixture-of-Experts LLM

[Arxiv'22] Branch-Train-Merge: Embarrassingly Parallel Training of Expert Language Models

[ICLR'24.5] Fusing Models with Complementary Expertise

[Arxiv'24.5] Learning More Generalized Experts by Merging Experts in Mixture-of-Experts

Sparse to Dense

[ACL'24.6] XFT: Unlocking the Power of Code Instruction Tuning by Simply Merging Upcycled Mixture-of-Experts

[Arxiv'23] Moduleformer: Learning modular large language models from uncurated data

[Arxiv'23] Experts weights averaging: A new general training scheme for vision transformers

[JMLR'22] Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

[Arxiv'22] One student knows all experts know: From sparse to dense

[Arxiv'22] Task-specific expert pruning for sparse mixture-of experts

[Arxiv'21] Efficient Large Scale Language Modeling with Mixtures of Experts

System-Level Optimization

Expert Parallel

[Arxiv'25.1] Optimizing Distributed Deployment of Mixture-of-Experts Model Inference in Serverless Computing

[ASPLOS'25] FSMoE: A Flexible and Scalable Training System for Sparse Mixture-of-Experts Models

[OpenReview'24.11] Toward Efficient Inference for Mixture of Experts

[Arxiv'24.10] EPS-MoE: Expert Pipeline Scheduler for Cost-Efficient MoE Inference

[IPDPS'24.1] Exploiting Inter-Layer Expert Affinity for Accelerating Mixture-of-Experts Model Inference

[Arxiv'24.10] Optimizing Mixture-of-Experts Inference Time Combining Model Deployment and Communication Scheduling

[IEEE'24.5] WDMoE: Wireless Distributed Large Language Models with Mixture of Experts

[Arxiv'24.11] Lynx: Enabling Efficient MoE Inference through Dynamic Batch-Aware Expert Selection

[Arxiv'24.4] Prediction Is All MoE Needs: Expert Load Distribution Goes from Fluctuating to Stabilizing

[Arxiv'24.10] MoE++: Accelerating Mixture-of-Experts Methods with Zero-Computation Experts [Code] [MoE Module Design]

[Arxiv'24.11] Shortcut-connected Expert Parallelism for Accelerating Mixture-of-Experts

[TSC'24.5] MoESys: A Distributed and Efficient Mixture-of-Experts Training and Inference System for Internet Services

[Arxiv'24.11] HEXA-MoE: Efficient and Heterogeneous-aware MoE Acceleration with ZERO Computation Redundancy [Code]

[Arxiv'24.5] LocMoE: A Low-Overhead MoE for Large Language Model Training

[Arxiv'24.7] Lazarus: Resilient and Elastic Training of Mixture-of-Experts Models with Adaptive Expert Placement

[Arxiv'24.10] Scattered Mixture-of-Experts Implementation [Code]

[TPDS'24.4] MPMoE: Memory Efficient MoE for Pre-Trained Models With Adaptive Pipeline Parallelism

[INFOCOM'24.5] Parm: Efficient Training of Large Sparsely-Activated Models with Dedicated Schedules

[EuroSys'24.4] ScheMoE: An Extensible Mixture-of-Experts Distributed Training System with Tasks Scheduling

[SIGCOMM'23] Janus: A Unified Distributed Training Framework for Sparse Mixture-of-Experts Models

[INFOCOM'23] PipeMoE: Accelerating Mixture-of-Experts through Adaptive Pipelining

[ATC'23] Accelerating Distributed MoE Training and Inference with Lina

[ATC'23] SmartMoE: Efficiently Training Sparsely-Activated Models through Combining Offline and Online Parallelization [Code]

[Arxiv'23] Towards MoE Deployment: Mitigating Inefficiencies in Mixture-of-Expert (MoE) Inference

[SIGMOD'23] FlexMoE: Scaling Large-scale Sparse Pre-trained Model Training via Dynamic Device Placement [Code]

[MLSys'23] Tutel: Adaptive Mixture-of-Experts at Scale [Code]

[OSDI'23] Optimizing Dynamic Neural Networks with Brainstorm [Code]

[ICS'23] A Hybrid Tensor-Expert-Data Parallelism Approach to Optimize Mixture-of-Experts Training

[CLUSTER'23] Prophet: Fine-grained Load Balancing for Parallel Training of Large-scale MoE Models

[OSDI'22] Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning [Code]

[NeurIPS'22] TA-MoE: Topology-Aware Large Scale Mixture-of-Expert Training [Code]

[NeurIPS'22] Mixture-of-Experts with Expert Choice Routing

[PPoPP'22] FasterMoE: modeling and optimizing training of large-scale dynamic pre-trained models [Code]

[PPoPP'22] BaGuaLu: targeting brain scale pretrained models with over 37 million cores

[SoCC'22] Accelerating large-scale distributed neural network training with SPMD parallelism

[PMLR'22] Gating Dropout: Communication-efficient Regularization for Sparsely Activated Transformers

[ICML'22] DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale [Code]

[Arxiv'22] HetuMoE: An Efficient Trillion-scale Mixture-of-Expert Distributed Training System [Code]

[Arxiv'21] FastMoE: A Fast Mixture-of-Expert Training System [Code]

[PMLR'21] BASE Layers: Simplifying Training of Large, Sparse Models [Code]

[Arxiv'20] GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

Expert Offloading

[Arxiv'24.10] ProMoE: Fast MoE-based LLM Serving using Proactive Caching

[NeurIPS'24.10] Read-ME: Refactorizing LLMs as Router-Decoupled Mixture of Experts with System Co-Design [Code]

[Arxiv'24.11] Shortcut-connected Expert Parallelism for Accelerating Mixture-of-Experts

[Arxiv'24.11] MoE-Lightning: High-Throughput MoE Inference on Memory-constrained GPUs

[Arxiv'24.11] HOBBIT: A Mixed Precision Expert Offloading System for Fast MoE Inference [Quantization, Skip Expert]

[Arxiv'24.10] ExpertFlow: Optimized Expert Activation and Token Allocation for Efficient Mixture-of-Experts Inference

[Arxiv'24.8] AdapMoE: Adaptive Sensitivity-based Expert Gating and Management for Efficient MoE Inference [Code] [Adaptive Gating]

[Arxiv'24.9] Mixture of Experts with Mixture of Precisions for Tuning Quality of Service

[MLSys'24.5] SiDA: Sparsity-Inspired Data-Aware Serving for Efficient and Scalable Large Mixture-of-Experts Models [Code]

[Arxiv'24.8] MoE-Infinity: Offloading-Efficient MoE Model Serving [Code]

[Arxiv'24.2] Fiddler: CPU-GPU Orchestration for Fast Inference of Mixture-of-Experts Models [Code]

[Arxiv'24.9] Mixture of Experts with Mixture of Precisions for Tuning Quality of Service

[Electronics'24.5] Efficient Inference Offloading for Mixture-of-Experts Large Language Models in Internet of Medical Things

[ISCA'24.4] Pre-gated MoE: An Algorithm-System Co-Design for Fast and Scalable Mixture-of-Expert Inference [Code] [MoE Module]

[HPCA'24.3] Enabling Large Dynamic Neural Network Training with Learning-based Memory Management

[SC'24.11] APTMoE: Affinity-Aware Pipeline Tuning for MoE Models on Bandwidth-Constrained GPU Nodes

[Arxiv'23] Fast Inference of Mixture-of-Experts Language Models with Offloading [Code]

[Arxiv'23] Towards MoE Deployment: Mitigating Inefficiencies in Mixture-of-Expert (MoE) Inference [Adaptive Gating]

[Arxiv'23] EdgeMoE: Fast On-Device Inference of MoE-based Large Language Models [Quantization]

[ACL'24.5] SwapMoE: Serving Off-the-shelf MoE-based Large Language Models with Tunable Memory Budget

Others

[SoCC '24.11] [MoEsaic: Shared Mixture of Experts]

Hareware-Level Optimization

[MICRO'24.9] Duplex: A Device for Large Language Models with Mixture of Experts, Grouped Query Attention, and Continuous Batching

[DAC'24.5] MoNDE: Mixture of Near-Data Experts for Large-Scale Sparse Models

[DAC'24.11] FLAME: Fully Leveraging MoE Sparsity for Transformer on FPGA

[ISSCC’24.2] Space-Mate: A 303.5mW Real-Time Sparse Mixture-of-Experts-Based NeRF-SLAM Processor for Mobile Spatial Computing

[ICCAD'23] Edge-MoE: Memory-Efficient Multi-Task Vision Transformer Architecture with Task-Level Sparsity via Mixture-of-Experts [Code]

[NeurIPS'22] M³ViT: Mixture-of-Experts Vision Transformer for Efficient Multi-task Learning with Model-Accelerator Co-design [Code]

Contribute

About

Curated collection of papers in MoE model inference

Report repository

Releases

No releases published

Packages

No packages published

Contributors 4