Skip to content

JustQJ/awesome-moe-inference

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

69 Commits
 
 

Repository files navigation

Awesome MoE LLM Inference System and Algorithm

Awesome PRs Welcome

A curated list of awesome papers about optimizing the inference of MoE-based LLMs.

Example: [Conference'year] Paper Title [Code]

Contents

Survey

[Preprints'24.8] The Evolution of Mixture of Experts: A Survey from Basics to Breakthroughs

[Arxiv'24.8] A Survey on Mixture of Experts [Code]

[Arxiv'22] A Review of Sparse Expert Models in Deep Learning

SOTA Open Source MoE LLMs

Reference Para. Experts #L #H $d_{model}$ $d_{ffn}$ $d_{expert}$ Affiliation Time
NLLB 54B 2/64/0 24 16 1024 8192 8192 FaceBook 2022.07
Qwen2-57B-A14B 57.4B 8/64/0 28 28 3584 18944 2560 Alibaba 2023.05
Mixtral-8x7B 46.7B 2/8/0 32 32 4096 14336 14336 Mistral AI 2023.12
OpenMoE 34B 2/16/0 12 12 768 2048 2048 NUS et al. 2023.12
DeepSeekMoE 16.4B 6/64/2 28 16 2048 10944 1408 DeepSeek-AI 2024.01
Qwen1.5-MoE 14.3B 4/60/0 24 16 2048 5632 1408 Alibaba 2024.02
JetMoE 8.52B 2/8/0 24 32 2048 5632 5632 MIT et al. 2024.03
Jamba 51.6B 2/16/0 32 32 4096 14336 14336 ai21labs 2024.03
DBRX 132B 4/16/0 40 48 6144 10752 10752 Databricks 2024.03
Grok-1 314B 2/8/0 64 48 6144 UNK UNK xAI 2024.03
Arctic 482B 2/128/0 35 56 7168 4864 4864 Snowflake 2024.04
Mixtral-8x22B 141B 2/8/0 56 48 6144 16384 16384 Mistral AI 2024.04
DeepSeek-V2 236B 6/160/2 60 128 5120 12288 1536 DeepSeek-AI 2024.04
Skywork-MoE 13B 2/16/0 52 36 4608 12288 12288 Kunlun Tech 2024.05
Yuan2 40B 2/32/0 24 16 2048 8192 8192 IEIT-Yuan 2024.05
LLaMA-MoE 6.7B 2/8/0 32 32 4096 11008 11008 Zhu et al. 2024.06
OLMoE 6.92B 8/64/0 16 16 2048 1024 1024 AllenAI 2024.07
Phi-3 41.9B 2/16/0 32 32 4096 6400 6400 MicroSoft 2024.08
GRIN-MoE 41.9B 2/16/0 32 32 4096 6400 6400 MicroSoft 2024.09
Hunyuan-Large 389B 1/16/1 64 80 6400 18304 18304 Tencent 2024.11

[Arxiv'24.11] Hunyuan-Large [Code]

[Arxiv'24.1] Mixtral-8x7B [Code]

[Arxiv'24.1] Mixtral-8x22B [Code]

[Arxiv'24.1] DeepseekMoE [Code]

[Arxiv'24.6] DeepSeek-V2 [Code]

[Arxiv'24.8] PhiMoE [Code]

[Arxiv'24.9] GRadient-INformed MoE [Code]

[Arxiv'24.9] Qwen2-57B-A14B [Code]

[QwenBlog'24.3] Qwen1.5-MoE [Code]

[Arxiv'24.9] OLMoE: Open Mixture-of-Experts Language Models [Code]

[Arxiv'24.3] OpenMoE: An Early Effort on Open Mixture-of-Experts Language Models [Code]

[Arxiv'24.6] Skywork-MoE [Code]

[Arxiv'24.4] JetMoE: Reaching Llama2 Performance with 0.1M Dollars[Code]

[Arxiv'24.5] Yuan 2.0-M32 [Code]

[MosaicResearchBlog'24.3] DBRX [Code]

[SnowflakeBlog'24.4] Arctic [Code]

[XAIBlog'24.3] Grok-1 [Code]

[Arxiv'24.7] Jamba [Code]

[Arxiv'24.6] LLaMA-MoE [Code]

[Arxiv'22] NLLB-MOE [Code]

[ICCV'21] Swin-MoE [Code]

Model-Level Optimizations

Efficient Architecture Design

Attention Module

[Arxiv'24.8] BAM! Just Like That: Simple and Efficient Parameter Upcycling for Mixture of Experts

[Arxiv'24.10] MoH: Multi-Head Attention as Mixture-of-Head Attention [Code]

[Arxiv'24.4] Dense Training, Sparse Inference: Rethinking Training of Mixture-of-Experts Language Models

[Arxiv'24.4] JetMoE: Reaching Llama2 Performance with 0.1M Dollars[Code]

[NeurIPS'24.10] MoEUT: Mixture-of-Experts Universal Transformers [Code]

[NeurIPS'24.9] SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention [Code]

[Arxiv'23] ModuleFormer: Modularity Emerges from Mixture-of-Experts [Code]

[Arxiv'23] Sparse Universal Transformer

[EMNLP'22] Mixture of Attention Heads: Selecting Attention Heads Per Token [Code]

[ACL'20] A Mixture of h - 1 Heads is Better than h Heads

MoE Module

[Arxiv'24.10] MoE++: Accelerating Mixture-of-Experts Methods with Zero-Computation Experts [Code]

[Arxiv'24.2] MoELoRA: Contrastive Learning Guided Mixture of Experts on Parameter-Efficient Fine-Tuning for Large Language Models

[Arxiv'23] Pre-gated MoE: An Algorithm-System Co-Design for Fast and Scalable Mixture-of-Expert Inference [Code]

[ICLR'23] SCoMoE: Efficient Mixtures of Experts with Structured Communication

[KDD'23] COMET: Learning Cardinality Constrained Mixture of Experts with Trees and Local Search

Model Compression

Pruning

[Arxiv'24.10] MoE-Pruner: Pruning Mixture-of-Experts Large Language Model using the Hints from Its Router

[Arxiv'24.4] SEER-MoE: Sparse Expert Efficiency through Regularization for Mixture-of-Experts

[Arxiv'24.10] Diversifying the Expert Knowledge for Task-Agnostic Pruning in Sparse Mixture-of-Experts

[Arxiv'24.7] Efficient Expert Pruning for Sparse Mixture-of-Experts Language Models: Enhancing Performance and Reducing Inference Costs [Code]

[ACL'24.5] Not All Experts are Equal: Efficient Expert Pruning and Skipping for Mixture-of-Experts Large Language Models [Code]

[Arxiv'24.9] Revisiting SMoE Language Models by Evaluating Inefficiencies with Task Specific Expert Pruning

[Arxiv'24.9] STUN: Structured-Then-Unstructured Pruning for Scalable MoE Pruning

[Arxiv'24.6] Demystifying the Compression of Mixture-of-Experts Through a Unified Framework [Code]

[Arxiv'24.5] A Provably Effective Method for Pruning Experts in Fine-tuned Sparse Mixture-of-Experts

[Arxiv'24.11] MoE-I2: Compressing Mixture of Experts Models through Inter-Expert Pruning and Intra-Expert Low-Rank Decomposition [Code]

[ICLR'24.3] Merge, Then Compress: Demystify Efficient SMoE with Hints from Its Routing Policy [Code]

[Arxiv'23] ModuleFormer: Modularity Emerges from Mixture-of-Experts [Code]

[Arxiv'22] Task-Specific Expert Pruning for Sparse Mixture-of-Experts

Quantization

[Arxiv'24.10] MC-MoE: Mixture Compressor for Mixture-of-Experts LLMs Gains More [Code]

[Arxiv'23] Mixture of Quantized Experts (MoQE): Complementary Effect of Low-bit Quantization and Robustness

[Arxiv'23] QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models [Code]

[Arxiv'24.11] HOBBIT: A Mixed Precision Expert Offloading System for Fast MoE Inference

[Arxiv'24.9] Mixture of Experts with Mixture of Precisions for Tuning Quality of Service

[Arxiv'24.6] Examining Post-Training Quantization for Mixture-of-Experts: A Benchmark [Code]

[INTERSPEECH'23] Compressed MoE ASR Model Based on Knowledge Distillation and Quantization

[Arxiv'23] EdgeMoE: Fast On-Device Inference of MoE-based Large Language Models [Quantization]

[EMNLP'22] Who Says Elephants Can't Run: Bringing Large Scale MoE Models into Cloud Scale Production

Knowledge Distillation

[Arxiv'24.10] LLaVA-MoD: Making LLaVA Tiny via MoE-Knowledge Distillation

[Arxiv'24.8] LaDiMo: Layer-wise Distillation Inspired MoEfier

[INTERSPEECH'23] Compressed MoE ASR Model Based on Knowledge Distillation and Quantization

[ICML'22] DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale [Code]

[MICROSOFT'22] Knowledge distillation for mixture of experts models in speech recognition

[Arxiv'22] One Student Knows All Experts Know: From Sparse to Dens

[JMLR'22] Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

[Arxiv'21] Efficient Large Scale Language Modeling with Mixtures of Experts

Low Rank Decomposition

[Arxiv'24.11] MoE-I2: Compressing Mixture of Experts Models through Inter-Expert Pruning and Intra-Expert Low-Rank Decomposition [Code]

[ICLR'24.3] Merge, Then Compress: Demystify Efficient SMoE with Hints from Its Routing Policy [Code]

[Arxiv'22] Parameter-Efficient Mixture-of-Experts Architecture for Pre-trained Language Models [Code]

Expert Skip/Adaptive Gating

[Arxiv'24.8] AdapMoE: Adaptive Sensitivity-based Expert Gating and Management for Efficient MoE Inference [Code]

[ACL'24.8] XMoE: Sparse Models with Fine-grained and Adaptive Expert Selection

[Arxiv'23] Dynamic Mixture of Experts: An Auto-Tuning Approach for Efficient Transformer Models [Code]

[Arxiv'23] Adaptive Gating in Mixture-of-Experts based Language Models

[Arxiv'23] Towards MoE Deployment: Mitigating Inefficiencies in Mixture-of-Expert (MoE) Inference

Merge Expert

[Arxiv'24.10] Retraining-Free Merging of Sparse Mixture-of-Experts via Hierarchical Clustering

[EMNLP'23] Merging Experts into One: Improving Computational Efficiency of Mixture of Experts

[Arxiv'24.3] Branch-Train-MiX:Mixing Expert LLMs into a Mixture-of-Experts LLM

[Arxiv'22] Branch-Train-Merge: Embarrassingly Parallel Training of Expert Language Models

[ICLR'24.5] Fusing Models with Complementary Expertise

[Arxiv'24.5] Learning More Generalized Experts by Merging Experts in Mixture-of-Experts

Sparse to Dense

[ACL'24.6] XFT: Unlocking the Power of Code Instruction Tuning by Simply Merging Upcycled Mixture-of-Experts

[Arxiv'23] Moduleformer: Learning modular large language models from uncurated data

[Arxiv'23] Experts weights averaging: A new general training scheme for vision transformers

[JMLR'22] Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

[Arxiv'22] One student knows all experts know: From sparse to dense

[Arxiv'22] Task-specific expert pruning for sparse mixture-of experts

[Arxiv'21] Efficient Large Scale Language Modeling with Mixtures of Experts

System-Level Optimization

Expert Parallel

[Arxiv'25.1] Optimizing Distributed Deployment of Mixture-of-Experts Model Inference in Serverless Computing

[ASPLOS'25] FSMoE: A Flexible and Scalable Training System for Sparse Mixture-of-Experts Models

[OpenReview'24.11] Toward Efficient Inference for Mixture of Experts

[Arxiv'24.10] EPS-MoE: Expert Pipeline Scheduler for Cost-Efficient MoE Inference

[IPDPS'24.1] Exploiting Inter-Layer Expert Affinity for Accelerating Mixture-of-Experts Model Inference

[Arxiv'24.10] Optimizing Mixture-of-Experts Inference Time Combining Model Deployment and Communication Scheduling

[IEEE'24.5] WDMoE: Wireless Distributed Large Language Models with Mixture of Experts

[Arxiv'24.11] Lynx: Enabling Efficient MoE Inference through Dynamic Batch-Aware Expert Selection

[Arxiv'24.4] Prediction Is All MoE Needs: Expert Load Distribution Goes from Fluctuating to Stabilizing

[Arxiv'24.10] MoE++: Accelerating Mixture-of-Experts Methods with Zero-Computation Experts [Code] [MoE Module Design]

[Arxiv'24.11] Shortcut-connected Expert Parallelism for Accelerating Mixture-of-Experts

[TSC'24.5] MoESys: A Distributed and Efficient Mixture-of-Experts Training and Inference System for Internet Services

[Arxiv'24.11] HEXA-MoE: Efficient and Heterogeneous-aware MoE Acceleration with ZERO Computation Redundancy [Code]

[Arxiv'24.5] LocMoE: A Low-Overhead MoE for Large Language Model Training

[Arxiv'24.7] Lazarus: Resilient and Elastic Training of Mixture-of-Experts Models with Adaptive Expert Placement

[Arxiv'24.10] Scattered Mixture-of-Experts Implementation [Code]

[TPDS'24.4] MPMoE: Memory Efficient MoE for Pre-Trained Models With Adaptive Pipeline Parallelism

[INFOCOM'24.5] Parm: Efficient Training of Large Sparsely-Activated Models with Dedicated Schedules

[EuroSys'24.4] ScheMoE: An Extensible Mixture-of-Experts Distributed Training System with Tasks Scheduling

[SIGCOMM'23] Janus: A Unified Distributed Training Framework for Sparse Mixture-of-Experts Models

[INFOCOM'23] PipeMoE: Accelerating Mixture-of-Experts through Adaptive Pipelining

[ATC'23] Accelerating Distributed MoE Training and Inference with Lina

[ATC'23] SmartMoE: Efficiently Training Sparsely-Activated Models through Combining Offline and Online Parallelization [Code]

[Arxiv'23] Towards MoE Deployment: Mitigating Inefficiencies in Mixture-of-Expert (MoE) Inference

[SIGMOD'23] FlexMoE: Scaling Large-scale Sparse Pre-trained Model Training via Dynamic Device Placement [Code]

[MLSys'23] Tutel: Adaptive Mixture-of-Experts at Scale [Code]

[OSDI'23] Optimizing Dynamic Neural Networks with Brainstorm [Code]

[ICS'23] A Hybrid Tensor-Expert-Data Parallelism Approach to Optimize Mixture-of-Experts Training

[CLUSTER'23] Prophet: Fine-grained Load Balancing for Parallel Training of Large-scale MoE Models

[OSDI'22] Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning [Code]

[NeurIPS'22] TA-MoE: Topology-Aware Large Scale Mixture-of-Expert Training [Code]

[NeurIPS'22] Mixture-of-Experts with Expert Choice Routing

[PPoPP'22] FasterMoE: modeling and optimizing training of large-scale dynamic pre-trained models [Code]

[PPoPP'22] BaGuaLu: targeting brain scale pretrained models with over 37 million cores

[SoCC'22] Accelerating large-scale distributed neural network training with SPMD parallelism

[PMLR'22] Gating Dropout: Communication-efficient Regularization for Sparsely Activated Transformers

[ICML'22] DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale [Code]

[Arxiv'22] HetuMoE: An Efficient Trillion-scale Mixture-of-Expert Distributed Training System [Code]

[Arxiv'21] FastMoE: A Fast Mixture-of-Expert Training System [Code]

[PMLR'21] BASE Layers: Simplifying Training of Large, Sparse Models [Code]

[Arxiv'20] GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

Expert Offloading

[Arxiv'24.10] ProMoE: Fast MoE-based LLM Serving using Proactive Caching

[NeurIPS'24.10] Read-ME: Refactorizing LLMs as Router-Decoupled Mixture of Experts with System Co-Design [Code]

[Arxiv'24.11] Shortcut-connected Expert Parallelism for Accelerating Mixture-of-Experts

[Arxiv'24.11] MoE-Lightning: High-Throughput MoE Inference on Memory-constrained GPUs

[Arxiv'24.11] HOBBIT: A Mixed Precision Expert Offloading System for Fast MoE Inference [Quantization, Skip Expert]

[Arxiv'24.10] ExpertFlow: Optimized Expert Activation and Token Allocation for Efficient Mixture-of-Experts Inference

[Arxiv'24.8] AdapMoE: Adaptive Sensitivity-based Expert Gating and Management for Efficient MoE Inference [Code] [Adaptive Gating]

[Arxiv'24.9] Mixture of Experts with Mixture of Precisions for Tuning Quality of Service

[MLSys'24.5] SiDA: Sparsity-Inspired Data-Aware Serving for Efficient and Scalable Large Mixture-of-Experts Models [Code]

[Arxiv'24.8] MoE-Infinity: Offloading-Efficient MoE Model Serving [Code]

[Arxiv'24.2] Fiddler: CPU-GPU Orchestration for Fast Inference of Mixture-of-Experts Models [Code]

[Arxiv'24.9] Mixture of Experts with Mixture of Precisions for Tuning Quality of Service

[Electronics'24.5] Efficient Inference Offloading for Mixture-of-Experts Large Language Models in Internet of Medical Things

[ISCA'24.4] Pre-gated MoE: An Algorithm-System Co-Design for Fast and Scalable Mixture-of-Expert Inference [Code] [MoE Module]

[HPCA'24.3] Enabling Large Dynamic Neural Network Training with Learning-based Memory Management

[SC'24.11] APTMoE: Affinity-Aware Pipeline Tuning for MoE Models on Bandwidth-Constrained GPU Nodes

[Arxiv'23] Fast Inference of Mixture-of-Experts Language Models with Offloading [Code]

[Arxiv'23] Towards MoE Deployment: Mitigating Inefficiencies in Mixture-of-Expert (MoE) Inference [Adaptive Gating]

[Arxiv'23] EdgeMoE: Fast On-Device Inference of MoE-based Large Language Models [Quantization]

[ACL'24.5] SwapMoE: Serving Off-the-shelf MoE-based Large Language Models with Tunable Memory Budget

Others

[SoCC '24.11] [MoEsaic: Shared Mixture of Experts]

Hareware-Level Optimization

[MICRO'24.9] Duplex: A Device for Large Language Models with Mixture of Experts, Grouped Query Attention, and Continuous Batching

[DAC'24.5] MoNDE: Mixture of Near-Data Experts for Large-Scale Sparse Models

[DAC'24.11] FLAME: Fully Leveraging MoE Sparsity for Transformer on FPGA

[ISSCC’24.2] Space-Mate: A 303.5mW Real-Time Sparse Mixture-of-Experts-Based NeRF-SLAM Processor for Mobile Spatial Computing

[ICCAD'23] Edge-MoE: Memory-Efficient Multi-Task Vision Transformer Architecture with Task-Level Sparsity via Mixture-of-Experts [Code]

[NeurIPS'22] M³ViT: Mixture-of-Experts Vision Transformer for Efficient Multi-task Learning with Model-Accelerator Co-design [Code]

Contribute

About

Curated collection of papers in MoE model inference

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •