Use python script to do LLM Model Evaluation.
- Introduction from paper with code: Paper-with-code
- 
Introduction: Medium Article 
- 
huggingface dataset: Huggingface Dataset 
- Step 1: please download the model from huggingface The following command line is the example of mistral-7B-v0.1 model:
git lfs install
git clone https://huggingface.co/mistralai/Mistral-7B-v0.1- 
Step 2: Please arrange the dataset from tmmluplus data folder to data_arrange folder. 
- 
Step 3: Please run the following code to predict the result: 
python3 evaluation_hf_testing.py \
    --model ./models/llama2-7b-hf \
    --data_dir ./llm_evaluation_tmmluplus/data_arrange/ \
    --save_dir ./llm_evaluation_tmmluplus/results/- Step 4: Please run the evaluation code to get the output json file.
!python /content/llm_model_evaluation/catogories_result_eval.py \
    --catogory "mmlu" \
    --model ./models/llama2-7b-hf \
    --save_dir "./results/results_llama2-7b-hf"
- mmlu dataset:
- Google Colab - mmlu
- Google Colab - mmlu in phi-2 model [Colab free tier can use this Google Colab example]
- tmmluplus dataset:
- mmlu dataset:
| 模型 | Weighted Accuracy | STEM | humanities | social sciences | other | Inference Time(s) | 
|---|---|---|---|---|---|---|
| Mistral-7B-v0.1 | 0.6254094858282296 | 0.5251822398939695 | 0.5636556854410202 | 0.7357816054598635 | 0.703578038247995 | 15624.038010835648 | 
- tmmluplus dataset:
| 模型 | Weighted Accuracy | STEM | humanities | social sciences | other | Inference Time(s) | 
|---|---|---|---|---|---|---|
| Mistral-7B-v0.1 | - | - | - | - | - | - |