Skip to content

⛷️ LLaMA-MoE: Building Mixture-of-Experts from LLaMA with Continual Pre-training (EMNLP 2024)

License

Notifications You must be signed in to change notification settings

pjlab-sys4nlp/llama-moe

Folders and files

NameName
Last commit message
Last commit date

Latest commit

b17aff4 · Jun 25, 2024
Dec 24, 2023
Oct 13, 2023
Feb 25, 2024
Feb 26, 2024
Feb 26, 2024
Nov 29, 2023
Dec 24, 2023
Aug 7, 2023
Dec 25, 2023
Feb 26, 2024
Jul 24, 2023
Dec 25, 2023
Jul 27, 2023
Jun 25, 2024
Jul 24, 2023
Dec 24, 2023
Dec 15, 2023
Aug 1, 2023
Aug 1, 2023

Repository files navigation

LLaMA-MoE: Building Mixture-of-Experts from LLaMA with Continual Pre-training

LLaMA-MoE favicon
📢 A SMALLER AFFORDABLE MoE MODEL FOR EVERYONE!! 📃 Technical Report

🎉 Introduction

LLaMA-MoE is a series of open-sourced Mixture-of-Expert (MoE) models based on LLaMA and SlimPajama. We build LLaMA-MoE with the following two steps:

  1. Partition LLaMA's FFNs into sparse experts and insert top-K gate for each layer of experts.
  2. Continually pre-train the initialized MoE model with an optimized data sampling weights from Sheared LLaMA and filtered datasets from SlimPajama.

MoE Routing

🔥 Features

  1. Lightweight Models: The number of activated model parameters is only 3.0~3.5B, which is friendly for deployment and research usage.
  2. Multiple Expert Construction Methods:
    1. Neuron-Independent: Random, Clustering, Co-activation Graph, Gradient (Zhang et al., 2022, Zuo et al., 2022)
    2. Neuron-Sharing: Inner, Inter (residual)
  3. Multiple MoE Gating Strategies:
    1. TopK Noisy Gate (Shazeer et al., 2017)
    2. Switch Gating (Fedus et al., 2022)
  4. Fast Continual Pre-training:
    1. FlashAttention-v2 integrated (Dao, 2023)
    2. Fast streaming dataset loading
  5. Abundant Monitor Items:
    1. Gate load, gate importance
    2. Loss on steps, loss on tokens, balance loss
    3. TGS (tokens/GPU/second), MFU (model FLOPs utilization)
    4. Other visualization utilities
  6. Dynamic Weight Sampling:
    1. Self-defined static sampling weights
    2. Sheared LLaMA's dynamic batch loading (Xia et al., 2023)

🚀 QuickStart

# python>=3.10

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_dir = "llama-moe/LLaMA-MoE-v1-3_5B-2_8"
tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_dir, torch_dtype=torch.bfloat16, trust_remote_code=True)
model.eval()
model.to("cuda:0")

input_text = "Suzhou is famous of"
inputs = tokenizer(input_text, return_tensors="pt")
inputs = inputs.to("cuda:0")

pred = model.generate(**inputs, max_length=50, temperature=0.0)
print(tokenizer.decode(pred.cpu()[0], skip_special_tokens=True))
# Suzhou is famous of its beautiful gardens. The most famous one is the Humble Administrator's Garden. It is a classical Chinese garden with a history of more than 600 years. The garden is divided into three

⚙️ Installation

  1. Prepare conda environment: conda create -n smoe python=3.11 (If your environment name is not smoe, you may need to change environment in launching scripts)
  2. Add correct environment variables in ~/.bashrc (gcc is set to newer version for installing flash-attn). e.g.:
    export PATH=/mnt/petrelfs/share/cuda-11.8/bin:$PATH
    export LD_LIBRARY_PATH=/mnt/petrelfs/share/cuda-11.8/lib64:$LD_LIBRARY_PATH
    export PATH=/mnt/petrelfs/share/gcc-10.1.0/bin:$PATH
    export LD_LIBRARY_PATH=/mnt/petrelfs/share/gcc-10.1.0/lib64:$LD_LIBRARY_PATH
  3. Take the variables into effect: source ~/.bashrc
  4. Install PyTorch (CUDA-11.8): pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
  5. Install dependencies: pip install -r requirements.txt
  6. Install flash-attn: pip install flash-attn==2.0.1 --no-build-isolation. You may need to follow the flash-attn installation instructions to avoid some errors.
  7. Install the latest Git: conda install git
  8. Clone the repo: git clone [email protected]:pjlab-sys4nlp/llama-moe.git (If you don't setup the ssh key to GitHub, you may not able to clone through ssh. Check the docs about it.)
  9. Change current directory: cd llama-moe
  10. Install smoe in editable mode: pip install -e .[dev]
  11. Setup pre-commit hooks: pre-commit install

📊 Model Performance

Model #Activated Experts #Experts #Activated Params Foundation Model SFT Model
LLaMA-MoE-3.0B 2 16 3.0B 🤗 base 🤗 SFT
LLaMA-MoE-3.5B (4/16) 4 16 3.5B 🤗 base 🤗 SFT
LLaMA-MoE-3.5B (2/8) 2 8 3.5B 🤗 base 🤗 SFT
  • Foundation models
Model Average SciQ PIQA WinoGrande ARC-e ARC-c (25) HellaSwag (10) LogiQA BoolQ (32) LAMBADA NQ (32) MMLU (5)
OPT-2.7B 50.3 78.9 74.8 60.8 54.4 34.0 61.4 25.8 63.3 63.6 10.7 25.8
Pythia-2.8B 51.5 83.2 73.6 59.6 58.8 36.7 60.7 28.1 65.9 64.6 8.7 26.8
INCITE-BASE-3B 53.7 85.6 73.9 63.5 61.7 40.3 64.7 27.5 65.8 65.4 15.2 27.2
Open-LLaMA-3B-v2 55.6 88.0 77.9 63.1 63.3 40.1 71.4 28.1 69.2 67.4 16.0 26.8
Sheared-LLaMA-2.7B 56.4 87.5 76.9 65.0 63.3 41.6 71.0 28.3 73.6 68.3 17.6 27.3
LLaMA-MoE-3.0B 55.5 84.2 77.5 63.6 60.2 40.9 70.8 30.6 71.9 66.6 17.0 26.8
LLaMA-MoE-3.5B (4/16) 57.7 87.6 77.9 65.5 65.6 44.2 73.3 29.7 75.0 69.5 20.3 26.8
LLaMA-MoE-3.5B (2/8) 57.6 88.4 77.6 66.7 65.3 43.1 73.3 29.6 73.9 69.4 19.8 27.0
  • SFT models
Model MMLU ARC-c HellaSeag TruthfulQA MT-Bench
Sheared LLaMA-2.7B ShareGPT 28.41 41.04 71.21 47.65 3.79
Sheared LLaMA-2.7B Deita6K (Our Impl.) 25.24 43.69 71.70 49.00 4.06
LLaMA-MoE-v1-3.0B (2/16) 23.61 43.43 72.28 44.24 4.15
LLaMA-MoE-v1-3.5B (4/16) 26.49 48.29 75.10 45.91 4.60
LLaMA-MoE-v1-3.5B (2/8) 25.53 45.99 74.95 44.39 4.72

🚧 Expert Construction

  • Neuron-Independent
    • IndependentRandom: bash ./scripts/expert_construction/split/run_split_random.sh
    • IndependentClustering: bash ./scripts/expert_construction/split/run_split_clustering.sh
  • Neuron-Sharing
    • SharingInner: bash ./scripts/expert_construction/split/run_split_gradient.sh
    • SharingInter: bash ./scripts/expert_construction/split/run_split_gradient_residual.sh

For more information, please refer to Expert Construction docs.

🚅 Continual Pre-training

Tokenization

Download SlimPajama into /path_to_data and put data from different domains into separate folders:

  • /path_to_data/en_arxiv
  • /path_to_data/en_book
  • /path_to_data/en_c4
  • /path_to_data/en_cc
  • /path_to_data/en_stack
  • /path_to_data/en_wikipedia
  • /path_to_data/github

Each file should be end with *.jsonl and each line looks like:

{"id": "id-info", "content": "raw text to be tokenized"}

Run the following command to tokenize the data in each folder:

python -m smoe.utils.tokenize \
  -f jsonl \
  -t /path_to_tokenizer \
  -i /path_to_data/en_arxiv \
  -o /path_to_data_tokenized/en_arxiv

Continual Pre-training (CPT)

  • NOTICE: Please create logs/ folder manually: mkdir -p logs
  • To run the continual pre-training, please check the CPT docs.

💎 Evaluation

💬 Supervised Fine-Tuning (SFT)

We provide simple examples of SFT to build chatbots. Please refer to SFT docs and /mnt/petrelfs/zhutong/smoe/scripts/sft for more details.

📑 Citation

@article{llama-moe,
  title={LLaMA-MoE: Building Mixture-of-Experts from LLaMA with Continual Pre-training},
  author={Tong Zhu and Xiaoye Qu and Daize Dong and Jiacheng Ruan and Jingqi Tong and Conghui He and Yu Cheng},
  journal={arXiv preprint arXiv:2406.16554},
  year={2024},
  url={https://arxiv.org/abs/2406.16554},
}

LLaMA-MoE Team w/ ❤️