Name	Name	Last commit message	Last commit date
Latest commit Spico197 Update citation info Jun 25, 2024 b17aff4 · Jun 25, 2024 History 212 Commits
.vscode	.vscode	update scripts	Dec 24, 2023
conf	conf	CPT: update eval support	Oct 13, 2023
docs	docs	add sft contents	Feb 25, 2024
scripts	scripts	resolve conflict in modeling_llama_moe_hf	Feb 26, 2024
smoe	smoe	resolve conflict in modeling_llama_moe_hf	Feb 26, 2024
tests	tests	update cpt scripts, add `msg_prefix` in notification, add `gate_balan…	Nov 29, 2023
tools	tools	Moefication: Format Standardization (v8)	Dec 24, 2023
.env.example	.env.example	add wechat group notification support	Aug 7, 2023
.gitattributes	.gitattributes	upload report	Dec 25, 2023
.gitignore	.gitignore	resolve conflict from llama-moe configuration update	Feb 26, 2024
.pre-commit-config.yaml	.pre-commit-config.yaml	init	Jul 24, 2023
LICENSE	LICENSE	upload report	Dec 25, 2023
Makefile	Makefile	update readme	Jul 27, 2023
README.md	README.md	Update citation info	Jun 25, 2024
VERSION	VERSION	formatting	Jul 24, 2023
example.py	example.py	update example	Dec 24, 2023
requirements.txt	requirements.txt	update mixtral support	Dec 15, 2023
setup.py	setup.py	full-param MoE training test ok	Aug 1, 2023
tox.ini	tox.ini	update format	Aug 1, 2023

Repository files navigation

LLaMA-MoE: Building Mixture-of-Experts from LLaMA with Continual Pre-training

📢 A SMALLER AFFORDABLE MoE MODEL FOR EVERYONE!!

📃 Technical Report

🎉 Introduction

LLaMA-MoE is a series of open-sourced Mixture-of-Expert (MoE) models based on LLaMA and SlimPajama. We build LLaMA-MoE with the following two steps:

Partition LLaMA's FFNs into sparse experts and insert top-K gate for each layer of experts.
Continually pre-train the initialized MoE model with an optimized data sampling weights from Sheared LLaMA and filtered datasets from SlimPajama.

🔥 Features

Lightweight Models: The number of activated model parameters is only 3.0~3.5B, which is friendly for deployment and research usage.
Multiple Expert Construction Methods:
1. Neuron-Independent: Random, Clustering, Co-activation Graph, Gradient (Zhang et al., 2022, Zuo et al., 2022)
2. Neuron-Sharing: Inner, Inter (residual)
Multiple MoE Gating Strategies:
1. TopK Noisy Gate (Shazeer et al., 2017)
2. Switch Gating (Fedus et al., 2022)
Fast Continual Pre-training:
1. FlashAttention-v2 integrated (Dao, 2023)
2. Fast streaming dataset loading
Abundant Monitor Items:
1. Gate load, gate importance
2. Loss on steps, loss on tokens, balance loss
3. TGS (tokens/GPU/second), MFU (model FLOPs utilization)
4. Other visualization utilities
Dynamic Weight Sampling:
1. Self-defined static sampling weights
2. Sheared LLaMA's dynamic batch loading (Xia et al., 2023)

🚀 QuickStart

# python>=3.10

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_dir = "llama-moe/LLaMA-MoE-v1-3_5B-2_8"
tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_dir, torch_dtype=torch.bfloat16, trust_remote_code=True)
model.eval()
model.to("cuda:0")

input_text = "Suzhou is famous of"
inputs = tokenizer(input_text, return_tensors="pt")
inputs = inputs.to("cuda:0")

pred = model.generate(**inputs, max_length=50, temperature=0.0)
print(tokenizer.decode(pred.cpu()[0], skip_special_tokens=True))
# Suzhou is famous of its beautiful gardens. The most famous one is the Humble Administrator's Garden. It is a classical Chinese garden with a history of more than 600 years. The garden is divided into three

⚙️ Installation

Prepare conda environment: conda create -n smoe python=3.11 (If your environment name is not smoe, you may need to change environment in launching scripts)

Add correct environment variables in ~/.bashrc (gcc is set to newer version for installing flash-attn). e.g.:

export PATH=/mnt/petrelfs/share/cuda-11.8/bin:$PATH
export LD_LIBRARY_PATH=/mnt/petrelfs/share/cuda-11.8/lib64:$LD_LIBRARY_PATH
export PATH=/mnt/petrelfs/share/gcc-10.1.0/bin:$PATH
export LD_LIBRARY_PATH=/mnt/petrelfs/share/gcc-10.1.0/lib64:$LD_LIBRARY_PATH

Take the variables into effect: source ~/.bashrc
Install PyTorch (CUDA-11.8): pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
Install dependencies: pip install -r requirements.txt
Install flash-attn: pip install flash-attn==2.0.1 --no-build-isolation. You may need to follow the flash-attn installation instructions to avoid some errors.
Install the latest Git: conda install git
Clone the repo: git clone [email protected]:pjlab-sys4nlp/llama-moe.git (If you don't setup the ssh key to GitHub, you may not able to clone through ssh. Check the docs about it.)
Change current directory: cd llama-moe
Install smoe in editable mode: pip install -e .[dev]
Setup pre-commit hooks: pre-commit install

📊 Model Performance

Model	#Activated Experts	#Experts	#Activated Params	Foundation Model	SFT Model
LLaMA-MoE-3.0B	2	16	3.0B	🤗 base	🤗 SFT
LLaMA-MoE-3.5B (4/16)	4	16	3.5B	🤗 base	🤗 SFT
LLaMA-MoE-3.5B (2/8)	2	8	3.5B	🤗 base	🤗 SFT

Foundation models

Model	Average	SciQ	PIQA	WinoGrande	ARC-e	ARC-c (25)	HellaSwag (10)	LogiQA	BoolQ (32)	LAMBADA	NQ (32)	MMLU (5)
OPT-2.7B	50.3	78.9	74.8	60.8	54.4	34.0	61.4	25.8	63.3	63.6	10.7	25.8
Pythia-2.8B	51.5	83.2	73.6	59.6	58.8	36.7	60.7	28.1	65.9	64.6	8.7	26.8
INCITE-BASE-3B	53.7	85.6	73.9	63.5	61.7	40.3	64.7	27.5	65.8	65.4	15.2	27.2
Open-LLaMA-3B-v2	55.6	88.0	77.9	63.1	63.3	40.1	71.4	28.1	69.2	67.4	16.0	26.8
Sheared-LLaMA-2.7B	56.4	87.5	76.9	65.0	63.3	41.6	71.0	28.3	73.6	68.3	17.6	27.3
LLaMA-MoE-3.0B	55.5	84.2	77.5	63.6	60.2	40.9	70.8	30.6	71.9	66.6	17.0	26.8
LLaMA-MoE-3.5B (4/16)	57.7	87.6	77.9	65.5	65.6	44.2	73.3	29.7	75.0	69.5	20.3	26.8
LLaMA-MoE-3.5B (2/8)	57.6	88.4	77.6	66.7	65.3	43.1	73.3	29.6	73.9	69.4	19.8	27.0

SFT models

Model	MMLU	ARC-c	HellaSeag	TruthfulQA	MT-Bench
Sheared LLaMA-2.7B ShareGPT	28.41	41.04	71.21	47.65	3.79
Sheared LLaMA-2.7B Deita6K (Our Impl.)	25.24	43.69	71.70	49.00	4.06
LLaMA-MoE-v1-3.0B (2/16)	23.61	43.43	72.28	44.24	4.15
LLaMA-MoE-v1-3.5B (4/16)	26.49	48.29	75.10	45.91	4.60
LLaMA-MoE-v1-3.5B (2/8)	25.53	45.99	74.95	44.39	4.72

🚧 Expert Construction

Neuron-Independent
- Independent_Random: bash ./scripts/expert_construction/split/run_split_random.sh
- Independent_Clustering: bash ./scripts/expert_construction/split/run_split_clustering.sh
Neuron-Sharing
- Sharing_Inner: bash ./scripts/expert_construction/split/run_split_gradient.sh
- Sharing_Inter: bash ./scripts/expert_construction/split/run_split_gradient_residual.sh

For more information, please refer to Expert Construction docs.

🚅 Continual Pre-training

Tokenization

Download SlimPajama into /path_to_data and put data from different domains into separate folders:

/path_to_data/en_arxiv
/path_to_data/en_book
/path_to_data/en_c4
/path_to_data/en_cc
/path_to_data/en_stack
/path_to_data/en_wikipedia
/path_to_data/github

Each file should be end with *.jsonl and each line looks like:

{"id": "id-info", "content": "raw text to be tokenized"}

Run the following command to tokenize the data in each folder:

python -m smoe.utils.tokenize \
  -f jsonl \
  -t /path_to_tokenizer \
  -i /path_to_data/en_arxiv \
  -o /path_to_data_tokenized/en_arxiv

Continual Pre-training (CPT)

NOTICE: Please create logs/ folder manually: mkdir -p logs
To run the continual pre-training, please check the CPT docs.

💎 Evaluation

For evalution on Natural Questions (NQ), please refer to opencompass.
For other tasks, please refer to lm-eval-harness.

💬 Supervised Fine-Tuning (SFT)

We provide simple examples of SFT to build chatbots. Please refer to SFT docs and /mnt/petrelfs/zhutong/smoe/scripts/sft for more details.

📑 Citation

@article{llama-moe,
  title={LLaMA-MoE: Building Mixture-of-Experts from LLaMA with Continual Pre-training},
  author={Tong Zhu and Xiaoye Qu and Daize Dong and Jiacheng Ruan and Jingqi Tong and Conghui He and Yu Cheng},
  journal={arXiv preprint arXiv:2406.16554},
  year={2024},
  url={https://arxiv.org/abs/2406.16554},
}

LLaMA-MoE Team w/ ❤️

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLaMA-MoE: Building Mixture-of-Experts from LLaMA with Continual Pre-training

🎉 Introduction

🔥 Features

🚀 QuickStart

⚙️ Installation

📊 Model Performance

🚧 Expert Construction

🚅 Continual Pre-training

Tokenization

Continual Pre-training (CPT)

💎 Evaluation

💬 Supervised Fine-Tuning (SFT)

📑 Citation

About

Releases 4

Packages

Contributors 5

Languages

License

pjlab-sys4nlp/llama-moe

Folders and files

Latest commit

History

Repository files navigation

LLaMA-MoE: Building Mixture-of-Experts from LLaMA with Continual Pre-training

🎉 Introduction

🔥 Features

🚀 QuickStart

⚙️ Installation

📊 Model Performance

🚧 Expert Construction

🚅 Continual Pre-training

Tokenization

Continual Pre-training (CPT)

💎 Evaluation

💬 Supervised Fine-Tuning (SFT)

📑 Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 4

Packages 0

Contributors 5

Languages

Packages