A curated list of awesome papers about optimizing the inference of MoE-based LLMs.
Example: [Conference'year] Paper Title [Code]
[Preprints'24.8] The Evolution of Mixture of Experts: A Survey from Basics to Breakthroughs
[Arxiv'24.8] A Survey on Mixture of Experts [Code]
[Arxiv'22] A Review of Sparse Expert Models in Deep Learning
Reference | Para. | Experts | #L | #H | Affiliation | Time | |||
---|---|---|---|---|---|---|---|---|---|
NLLB | 54B | 2/64/0 | 24 | 16 | 1024 | 8192 | 8192 | 2022.07 | |
Qwen2-57B-A14B | 57.4B | 8/64/0 | 28 | 28 | 3584 | 18944 | 2560 | Alibaba | 2023.05 |
Mixtral-8x7B | 46.7B | 2/8/0 | 32 | 32 | 4096 | 14336 | 14336 | Mistral AI | 2023.12 |
OpenMoE | 34B | 2/16/0 | 12 | 12 | 768 | 2048 | 2048 | NUS et al. | 2023.12 |
DeepSeekMoE | 16.4B | 6/64/2 | 28 | 16 | 2048 | 10944 | 1408 | DeepSeek-AI | 2024.01 |
Qwen1.5-MoE | 14.3B | 4/60/0 | 24 | 16 | 2048 | 5632 | 1408 | Alibaba | 2024.02 |
JetMoE | 8.52B | 2/8/0 | 24 | 32 | 2048 | 5632 | 5632 | MIT et al. | 2024.03 |
Jamba | 51.6B | 2/16/0 | 32 | 32 | 4096 | 14336 | 14336 | ai21labs | 2024.03 |
DBRX | 132B | 4/16/0 | 40 | 48 | 6144 | 10752 | 10752 | Databricks | 2024.03 |
Grok-1 | 314B | 2/8/0 | 64 | 48 | 6144 | UNK | UNK | xAI | 2024.03 |
Arctic | 482B | 2/128/0 | 35 | 56 | 7168 | 4864 | 4864 | Snowflake | 2024.04 |
Mixtral-8x22B | 141B | 2/8/0 | 56 | 48 | 6144 | 16384 | 16384 | Mistral AI | 2024.04 |
DeepSeek-V2 | 236B | 6/160/2 | 60 | 128 | 5120 | 12288 | 1536 | DeepSeek-AI | 2024.04 |
Skywork-MoE | 13B | 2/16/0 | 52 | 36 | 4608 | 12288 | 12288 | Kunlun Tech | 2024.05 |
Yuan2 | 40B | 2/32/0 | 24 | 16 | 2048 | 8192 | 8192 | IEIT-Yuan | 2024.05 |
LLaMA-MoE | 6.7B | 2/8/0 | 32 | 32 | 4096 | 11008 | 11008 | Zhu et al. | 2024.06 |
OLMoE | 6.92B | 8/64/0 | 16 | 16 | 2048 | 1024 | 1024 | AllenAI | 2024.07 |
Phi-3 | 41.9B | 2/16/0 | 32 | 32 | 4096 | 6400 | 6400 | MicroSoft | 2024.08 |
GRIN-MoE | 41.9B | 2/16/0 | 32 | 32 | 4096 | 6400 | 6400 | MicroSoft | 2024.09 |
Hunyuan-Large | 389B | 1/16/1 | 64 | 80 | 6400 | 18304 | 18304 | Tencent | 2024.11 |
[Arxiv'24.11] Hunyuan-Large [Code]
[Arxiv'24.1] Mixtral-8x7B [Code]
[Arxiv'24.1] Mixtral-8x22B [Code]
[Arxiv'24.1] DeepseekMoE [Code]
[Arxiv'24.6] DeepSeek-V2 [Code]
[Arxiv'24.9] GRadient-INformed MoE [Code]
[Arxiv'24.9] Qwen2-57B-A14B [Code]
[QwenBlog'24.3] Qwen1.5-MoE [Code]
[Arxiv'24.9] OLMoE: Open Mixture-of-Experts Language Models [Code]
[Arxiv'24.3] OpenMoE: An Early Effort on Open Mixture-of-Experts Language Models [Code]
[Arxiv'24.6] Skywork-MoE [Code]
[Arxiv'24.4] JetMoE: Reaching Llama2 Performance with 0.1M Dollars[Code]
[Arxiv'24.5] Yuan 2.0-M32 [Code]
[MosaicResearchBlog'24.3] DBRX [Code]
[SnowflakeBlog'24.4] Arctic [Code]
[Arxiv'24.8] BAM! Just Like That: Simple and Efficient Parameter Upcycling for Mixture of Experts
[Arxiv'24.10] MoH: Multi-Head Attention as Mixture-of-Head Attention [Code]
[Arxiv'24.4] Dense Training, Sparse Inference: Rethinking Training of Mixture-of-Experts Language Models
[Arxiv'24.4] JetMoE: Reaching Llama2 Performance with 0.1M Dollars[Code]
[NeurIPS'24.10] MoEUT: Mixture-of-Experts Universal Transformers [Code]
[NeurIPS'24.9] SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention [Code]
[Arxiv'23] ModuleFormer: Modularity Emerges from Mixture-of-Experts [Code]
[Arxiv'23] Sparse Universal Transformer
[EMNLP'22] Mixture of Attention Heads: Selecting Attention Heads Per Token [Code]
[ACL'20] A Mixture of h - 1 Heads is Better than h Heads
[Arxiv'24.10] MoE++: Accelerating Mixture-of-Experts Methods with Zero-Computation Experts [Code]
[Arxiv'24.2] MoELoRA: Contrastive Learning Guided Mixture of Experts on Parameter-Efficient Fine-Tuning for Large Language Models
[Arxiv'23] Pre-gated MoE: An Algorithm-System Co-Design for Fast and Scalable Mixture-of-Expert Inference [Code]
[ICLR'23] SCoMoE: Efficient Mixtures of Experts with Structured Communication
[KDD'23] COMET: Learning Cardinality Constrained Mixture of Experts with Trees and Local Search
[Arxiv'24.10] MoE-Pruner: Pruning Mixture-of-Experts Large Language Model using the Hints from Its Router
[Arxiv'24.4] SEER-MoE: Sparse Expert Efficiency through Regularization for Mixture-of-Experts
[Arxiv'24.10] Diversifying the Expert Knowledge for Task-Agnostic Pruning in Sparse Mixture-of-Experts
[Arxiv'24.7] Efficient Expert Pruning for Sparse Mixture-of-Experts Language Models: Enhancing Performance and Reducing Inference Costs [Code]
[ACL'24.5] Not All Experts are Equal: Efficient Expert Pruning and Skipping for Mixture-of-Experts Large Language Models [Code]
[Arxiv'24.9] Revisiting SMoE Language Models by Evaluating Inefficiencies with Task Specific Expert Pruning
[Arxiv'24.9] STUN: Structured-Then-Unstructured Pruning for Scalable MoE Pruning
[Arxiv'24.6] Demystifying the Compression of Mixture-of-Experts Through a Unified Framework [Code]
[Arxiv'24.5] A Provably Effective Method for Pruning Experts in Fine-tuned Sparse Mixture-of-Experts
[Arxiv'24.11] MoE-I2: Compressing Mixture of Experts Models through Inter-Expert Pruning and Intra-Expert Low-Rank Decomposition [Code]
[ICLR'24.3] Merge, Then Compress: Demystify Efficient SMoE with Hints from Its Routing Policy [Code]
[Arxiv'23] ModuleFormer: Modularity Emerges from Mixture-of-Experts [Code]
[Arxiv'22] Task-Specific Expert Pruning for Sparse Mixture-of-Experts
[Arxiv'24.10] MC-MoE: Mixture Compressor for Mixture-of-Experts LLMs Gains More [Code]
[Arxiv'23] Mixture of Quantized Experts (MoQE): Complementary Effect of Low-bit Quantization and Robustness
[Arxiv'23] QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models [Code]
[Arxiv'24.11] HOBBIT: A Mixed Precision Expert Offloading System for Fast MoE Inference
[Arxiv'24.9] Mixture of Experts with Mixture of Precisions for Tuning Quality of Service
[Arxiv'24.6] Examining Post-Training Quantization for Mixture-of-Experts: A Benchmark [Code]
[INTERSPEECH'23] Compressed MoE ASR Model Based on Knowledge Distillation and Quantization
[Arxiv'23] EdgeMoE: Fast On-Device Inference of MoE-based Large Language Models [Quantization]
[EMNLP'22] Who Says Elephants Can't Run: Bringing Large Scale MoE Models into Cloud Scale Production
[Arxiv'24.10] LLaVA-MoD: Making LLaVA Tiny via MoE-Knowledge Distillation
[Arxiv'24.8] LaDiMo: Layer-wise Distillation Inspired MoEfier
[INTERSPEECH'23] Compressed MoE ASR Model Based on Knowledge Distillation and Quantization
[ICML'22] DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale [Code]
[MICROSOFT'22] Knowledge distillation for mixture of experts models in speech recognition
[Arxiv'22] One Student Knows All Experts Know: From Sparse to Dens
[JMLR'22] Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
[Arxiv'21] Efficient Large Scale Language Modeling with Mixtures of Experts
[Arxiv'24.11] MoE-I2: Compressing Mixture of Experts Models through Inter-Expert Pruning and Intra-Expert Low-Rank Decomposition [Code]
[ICLR'24.3] Merge, Then Compress: Demystify Efficient SMoE with Hints from Its Routing Policy [Code]
[Arxiv'22] Parameter-Efficient Mixture-of-Experts Architecture for Pre-trained Language Models [Code]
[Arxiv'24.8] AdapMoE: Adaptive Sensitivity-based Expert Gating and Management for Efficient MoE Inference [Code]
[ACL'24.8] XMoE: Sparse Models with Fine-grained and Adaptive Expert Selection
[Arxiv'23] Dynamic Mixture of Experts: An Auto-Tuning Approach for Efficient Transformer Models [Code]
[Arxiv'23] Adaptive Gating in Mixture-of-Experts based Language Models
[Arxiv'23] Towards MoE Deployment: Mitigating Inefficiencies in Mixture-of-Expert (MoE) Inference
[Arxiv'24.10] Retraining-Free Merging of Sparse Mixture-of-Experts via Hierarchical Clustering
[EMNLP'23] Merging Experts into One: Improving Computational Efficiency of Mixture of Experts
[Arxiv'24.3] Branch-Train-MiX:Mixing Expert LLMs into a Mixture-of-Experts LLM
[Arxiv'22] Branch-Train-Merge: Embarrassingly Parallel Training of Expert Language Models
[ICLR'24.5] Fusing Models with Complementary Expertise
[Arxiv'24.5] Learning More Generalized Experts by Merging Experts in Mixture-of-Experts
[ACL'24.6] XFT: Unlocking the Power of Code Instruction Tuning by Simply Merging Upcycled Mixture-of-Experts
[Arxiv'23] Moduleformer: Learning modular large language models from uncurated data
[Arxiv'23] Experts weights averaging: A new general training scheme for vision transformers
[JMLR'22] Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
[Arxiv'22] One student knows all experts know: From sparse to dense
[Arxiv'22] Task-specific expert pruning for sparse mixture-of experts
[Arxiv'21] Efficient Large Scale Language Modeling with Mixtures of Experts
[Arxiv'25.1] Optimizing Distributed Deployment of Mixture-of-Experts Model Inference in Serverless Computing
[ASPLOS'25] FSMoE: A Flexible and Scalable Training System for Sparse Mixture-of-Experts Models
[OpenReview'24.11] Toward Efficient Inference for Mixture of Experts
[Arxiv'24.10] EPS-MoE: Expert Pipeline Scheduler for Cost-Efficient MoE Inference
[IPDPS'24.1] Exploiting Inter-Layer Expert Affinity for Accelerating Mixture-of-Experts Model Inference
[Arxiv'24.10] Optimizing Mixture-of-Experts Inference Time Combining Model Deployment and Communication Scheduling
[IEEE'24.5] WDMoE: Wireless Distributed Large Language Models with Mixture of Experts
[Arxiv'24.11] Lynx: Enabling Efficient MoE Inference through Dynamic Batch-Aware Expert Selection
[Arxiv'24.4] Prediction Is All MoE Needs: Expert Load Distribution Goes from Fluctuating to Stabilizing
[Arxiv'24.10] MoE++: Accelerating Mixture-of-Experts Methods with Zero-Computation Experts [Code] [MoE Module Design]
[Arxiv'24.11] Shortcut-connected Expert Parallelism for Accelerating Mixture-of-Experts
[Arxiv'24.11] HEXA-MoE: Efficient and Heterogeneous-aware MoE Acceleration with ZERO Computation Redundancy [Code]
[Arxiv'24.5] LocMoE: A Low-Overhead MoE for Large Language Model Training
[Arxiv'24.7] Lazarus: Resilient and Elastic Training of Mixture-of-Experts Models with Adaptive Expert Placement
[Arxiv'24.10] Scattered Mixture-of-Experts Implementation [Code]
[TPDS'24.4] MPMoE: Memory Efficient MoE for Pre-Trained Models With Adaptive Pipeline Parallelism
[INFOCOM'24.5] Parm: Efficient Training of Large Sparsely-Activated Models with Dedicated Schedules
[EuroSys'24.4] ScheMoE: An Extensible Mixture-of-Experts Distributed Training System with Tasks Scheduling
[SIGCOMM'23] Janus: A Unified Distributed Training Framework for Sparse Mixture-of-Experts Models
[INFOCOM'23] PipeMoE: Accelerating Mixture-of-Experts through Adaptive Pipelining
[ATC'23] Accelerating Distributed MoE Training and Inference with Lina
[ATC'23] SmartMoE: Efficiently Training Sparsely-Activated Models through Combining Offline and Online Parallelization [Code]
[Arxiv'23] Towards MoE Deployment: Mitigating Inefficiencies in Mixture-of-Expert (MoE) Inference
[SIGMOD'23] FlexMoE: Scaling Large-scale Sparse Pre-trained Model Training via Dynamic Device Placement [Code]
[MLSys'23] Tutel: Adaptive Mixture-of-Experts at Scale [Code]
[OSDI'23] Optimizing Dynamic Neural Networks with Brainstorm [Code]
[ICS'23] A Hybrid Tensor-Expert-Data Parallelism Approach to Optimize Mixture-of-Experts Training
[CLUSTER'23] Prophet: Fine-grained Load Balancing for Parallel Training of Large-scale MoE Models
[OSDI'22] Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning [Code]
[NeurIPS'22] TA-MoE: Topology-Aware Large Scale Mixture-of-Expert Training [Code]
[NeurIPS'22] Mixture-of-Experts with Expert Choice Routing
[PPoPP'22] FasterMoE: modeling and optimizing training of large-scale dynamic pre-trained models [Code]
[PPoPP'22] BaGuaLu: targeting brain scale pretrained models with over 37 million cores
[SoCC'22] Accelerating large-scale distributed neural network training with SPMD parallelism
[PMLR'22] Gating Dropout: Communication-efficient Regularization for Sparsely Activated Transformers
[ICML'22] DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale [Code]
[Arxiv'22] HetuMoE: An Efficient Trillion-scale Mixture-of-Expert Distributed Training System [Code]
[Arxiv'21] FastMoE: A Fast Mixture-of-Expert Training System [Code]
[PMLR'21] BASE Layers: Simplifying Training of Large, Sparse Models [Code]
[Arxiv'20] GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding
[Arxiv'24.10] ProMoE: Fast MoE-based LLM Serving using Proactive Caching
[NeurIPS'24.10] Read-ME: Refactorizing LLMs as Router-Decoupled Mixture of Experts with System Co-Design [Code]
[Arxiv'24.11] Shortcut-connected Expert Parallelism for Accelerating Mixture-of-Experts
[Arxiv'24.11] MoE-Lightning: High-Throughput MoE Inference on Memory-constrained GPUs
[Arxiv'24.11] HOBBIT: A Mixed Precision Expert Offloading System for Fast MoE Inference [Quantization, Skip Expert]
[Arxiv'24.10] ExpertFlow: Optimized Expert Activation and Token Allocation for Efficient Mixture-of-Experts Inference
[Arxiv'24.8] AdapMoE: Adaptive Sensitivity-based Expert Gating and Management for Efficient MoE Inference [Code] [Adaptive Gating]
[Arxiv'24.9] Mixture of Experts with Mixture of Precisions for Tuning Quality of Service
[MLSys'24.5] SiDA: Sparsity-Inspired Data-Aware Serving for Efficient and Scalable Large Mixture-of-Experts Models [Code]
[Arxiv'24.8] MoE-Infinity: Offloading-Efficient MoE Model Serving [Code]
[Arxiv'24.2] Fiddler: CPU-GPU Orchestration for Fast Inference of Mixture-of-Experts Models [Code]
[Arxiv'24.9] Mixture of Experts with Mixture of Precisions for Tuning Quality of Service
[Electronics'24.5] Efficient Inference Offloading for Mixture-of-Experts Large Language Models in Internet of Medical Things
[ISCA'24.4] Pre-gated MoE: An Algorithm-System Co-Design for Fast and Scalable Mixture-of-Expert Inference [Code] [MoE Module]
[HPCA'24.3] Enabling Large Dynamic Neural Network Training with Learning-based Memory Management
[SC'24.11] APTMoE: Affinity-Aware Pipeline Tuning for MoE Models on Bandwidth-Constrained GPU Nodes
[Arxiv'23] Fast Inference of Mixture-of-Experts Language Models with Offloading [Code]
[Arxiv'23] Towards MoE Deployment: Mitigating Inefficiencies in Mixture-of-Expert (MoE) Inference [Adaptive Gating]
[Arxiv'23] EdgeMoE: Fast On-Device Inference of MoE-based Large Language Models [Quantization]
[ACL'24.5] SwapMoE: Serving Off-the-shelf MoE-based Large Language Models with Tunable Memory Budget
[SoCC '24.11] [MoEsaic: Shared Mixture of Experts]
[MICRO'24.9] Duplex: A Device for Large Language Models with Mixture of Experts, Grouped Query Attention, and Continuous Batching
[DAC'24.5] MoNDE: Mixture of Near-Data Experts for Large-Scale Sparse Models
[DAC'24.11] FLAME: Fully Leveraging MoE Sparsity for Transformer on FPGA
[ISSCC’24.2] Space-Mate: A 303.5mW Real-Time Sparse Mixture-of-Experts-Based NeRF-SLAM Processor for Mobile Spatial Computing
[ICCAD'23] Edge-MoE: Memory-Efficient Multi-Task Vision Transformer Architecture with Task-Level Sparsity via Mixture-of-Experts [Code]
[NeurIPS'22] M³ViT: Mixture-of-Experts Vision Transformer for Efficient Multi-task Learning with Model-Accelerator Co-design [Code]