Skip to content

Latest commit

 

History

History
638 lines (607 loc) · 137 KB

README.md

File metadata and controls

638 lines (607 loc) · 137 KB

Preferred Generation Benchmark

pfgen-benchmark is a benchmark designed to evaluate Japanese text generation, specifically for pretrained models. Unlike conventional benchmarks that use templates containing instructions, this benchmark relies solely on numerous examples. By conveying expectations such as the question-answering nature of the task, responses of approximately 100 characters, and outputs resembling formal public documents purely through examples, it minimizes the influence of differences in instructions or templates. Additionally, output evaluation is conducted using n-gram-based methods, enabling quick, cost-effective, and deterministic evaluations, unlike the LLM as a Judge approach.

To enable comparisons across as many models as possible, the leaderboard actively includes a wide range of models. These include openly accessible models, models cited in academic papers, and those announced by companies through press releases. Contributions of model outputs are encouraged, and results can be submitted via pull requests. For detailed instructions on how to contribute, please refer to the "How to Contribute" section.

See more details: arXiv:2502.09316

pfgen-benchmark は事前学習モデル向けに設計された日本語の生成文を評価するベンチマークです。通常のベンチマークでは指示文を含むテンプレートを使いますが、このベンチマークでは多数の例示のみを行います。質問応答タスクであることや、約100字の回答、公用文に近い出力を期待していることを例示のみで伝えることで、指示文やテンプレートの差異による影響を小さくしています。また、出力文の評価は n-gram を用いた方法を用いており、LLM as a Judge の手法と異なり、短時間、低コストでかつ決定的な評価を可能にしています。

詳しくはこちら: Jxiv preprint

できる限り多くのモデルを同じ軸で比較できるように、リーダーボードには積極的に多くのモデル掲載しています。オープンにアクセス可能なモデル、論文で言及されているモデル、企業がプレスリリースを出しているモデルなど、比較の価値があると思われるモデルについては、是非プルリクエストで出力を追加してください。追加方法については「How to contribute」を参照ください。

License of LLM Output

The license for parts of this repository, except for LLM-generated outputs, is Apache License Version 2.0. The license for LLM-generated outputs depends on the license of each model.

How to Evaluate a Model

You can evaluate the model using either run-hf.py (which uses transformers) or run-vllm.py (which uses vLLM). For detailed parameters, refer to --help. The --num-trials parameter, which determines the number of patterns for which the model will generate answers, should be decided considering the trade-off between execution time and required accuracy.

For pretrained models:

# Run a model using Huggingface library or vLLM.
python ./run-hf.py --model=llm-jp/llm-jp-3-150m --num-trials=5

# Evaluate output and update leaderboard.
make

For instruction models:

# Run a model using Huggingface library or vLLM with three templates.
python ./run-hf.py --model=llm-jp/llm-jp-3-150m-instruct3 --num-trials=5
python ./run-hf.py --model=llm-jp/llm-jp-3-150m-instruct3 --num-trials=5 --mode=qa
python ./run-hf.py --model=llm-jp/llm-jp-3-150m-instruct3 --num-trials=5 --mode=chat

# Evaluate output and update leaderboard.
make

Command-line Arguments

  • --model={{model name}} ... The model name. (Required)
  • --path={{path to model directory}} ... The path to a local model directory. (Default: None)
  • --num-trials={{number of trials}} ... The number of trials. (Default: 10)
  • --mode={{mode}} ... Must be one of completion, qa, and chat. (Default: completion)
    • qa and chat can be used only when the model has a chat template.
    • The instruction message will be included in a user message for qa and in a system message for chat.

How to Contribute

Follow the instructions in the "How to Evaluate a Model" section to run the evaluation. This process will generate config.json and trials.jsonl.xz files under the result directory. Please create a pull request containing only these two files.

To ensure more accurate ranking among models, the number of executions (--num-trials) should be as many as possible, within the limit of 100 trials.

Leaderboard

🟢 ... completion mode, 💬 ... qa/chat mode.

Rank Score                    Model                                       Length           Fluency Truthfulness Helpfulness
N/A 1.0501 (±0.0000/√1) 👑 system/ground-truth 100.0 (±0.0) 1.155 0.996 1.000
1 0.9338 (±0.0145/√10) 🟢 DeepSeek-V3 100.8 (±6.2) 1.009 0.969 0.822
2 0.9307 (±0.0083/√18) 💬 chatgpt-4o-latest 99.1 (±14.8) 0.954 0.968 0.870
3 0.9303 (±0.0083/√10) 💬 anthropic/claude-3-5-sonnet-20240620 102.2 (±10.4) 0.949 0.959 0.883
4 0.8615 (±0.0092/√10) 💬 openai/gpt-4o 84.5 (±18.6) 0.919 0.980 0.686
5 0.8584 (±0.0163/√10) 💬 deepseek-ai/DeepSeek-R1 106.1 (±13.5) 0.839 0.929 0.807
N/A 0.8494 (±0.0253/√1000) 🎯 system/criteria 100.0 (±3.4) 0.936 0.978 0.505
6 0.8359 (±0.0216/√10) 💬 Qwen/Qwen-Max-2025-01-25 89.6 (±18.7) 0.864 0.968 0.676
7 0.8352 (±0.0107/√10) 💬 Qwen/Qwen-Max 88.8 (±18.7) 0.862 0.964 0.679
8 0.8279 (±0.0131/√10) 💬 MiniMax-Text-01 77.8 (±22.2) 0.858 0.988 0.638
9 0.8270 (±0.0229/√10) 💬 anthropic/claude-3-opus-20240229 102.3 (±9.5) 0.911 0.944 0.627
10 0.8192 (±0.0207/√10) 💬 google/gemini-1.5-pro-002 76.3 (±17.4) 0.826 0.976 0.656
11 0.8157 (±0.0119/√10) 💬 MiniMax-Text-01 78.9 (±25.5) 0.850 0.986 0.611
12 0.8036 (±0.0133/√10) 💬 openai/gpt-4-turbo 86.5 (±17.4) 0.820 0.959 0.632
13 0.7916 (±0.0146/√10) 💬 openai/gpt-4 107.2 (±11.6) 0.888 0.951 0.536
14 0.7827 (±0.0129/√100) 💬 Qwen/Qwen2.5-72B-Instruct 98.7 (±14.8) 0.871 0.936 0.540
15 0.7789 (±0.0213/√100) 🟢 weblab-GENIAC/Tanuki-8x8B-dpo-v1.0 109.1 (±36.8) 0.890 0.941 0.506
16 0.7782 (±0.0154/√100) 💬 Qwen/Qwen2.5-72B-Instruct 96.5 (±17.8) 0.847 0.939 0.549
17 0.7773 (±0.0168/√100) 💬 pfnet/plamo-1.0-prime 178.2 (±114.5) 0.874 0.942 0.516
18 0.7768 (±0.0113/√5) 💬 mlx-community/Qwen2.5-72B-Instruct-4bit 100.8 (±17.7) 0.860 0.933 0.538
19 0.7766 (±0.0276/√100) 🟢 tokyotech-llm/Swallow-70b-NVE-hf 104.1 (±17.9) 0.884 0.938 0.507
20 0.7756 (±0.0264/√100) 🟢 tokyotech-llm/Swallow-70b-NVE-instruc... 104.1 (±18.5) 0.878 0.938 0.510
21 0.7748 (±0.0000/√1) 💬 openai/chatgpt-o1 76.3 (±17.7) 0.755 0.960 0.610
22 0.7748 (±0.0299/√100) 🟢 sbintuitions/sarashina2-8x70b 105.7 (±21.5) 0.867 0.937 0.520
23 0.7735 (±0.0254/√50) 🟢 abeja/ABEJA-Qwen2.5-32b-Japanese-v0.1 154.6 (±121.1) 0.845 0.923 0.553
24 0.7650 (±0.0263/√100) 🟢 tokyotech-llm/Swallow-70b-instruct-hf 102.5 (±14.4) 0.872 0.929 0.494
25 0.7643 (±0.0000/√1) 💬 openai/chatgpt-o1-pro 79.5 (±17.3) 0.748 0.955 0.590
26 0.7628 (±0.0275/√100) 🟢 tokyotech-llm/Swallow-70b-hf 103.5 (±16.1) 0.876 0.930 0.483
27 0.7601 (±0.0289/√100) 🟢 tokyotech-llm/Llama-3.1-Swallow-70B-v0.1 106.3 (±21.0) 0.864 0.925 0.492
28 0.7538 (±0.0251/√100) 🟢 turing-motors/Llama-3-heron-brain-70B... 101.1 (±16.9) 0.857 0.925 0.479
29 0.7526 (±0.0243/√100) 🟢 pfnet/plamo-2-8b 103.7 (±17.3) 0.863 0.939 0.456
30 0.7509 (±0.0253/√100) 🟢 sbintuitions/sarashina2.2-3b-instruct... 119.0 (±25.1) 0.844 0.893 0.515
31 0.7501 (±0.0237/√100) 💬 weblab-GENIAC/Tanuki-8x8B-dpo-v1.0 181.0 (±87.4) 0.847 0.923 0.480
32 0.7469 (±0.0270/√100) 🟢 pfnet/plamo-100b-base 115.2 (±64.0) 0.861 0.920 0.460
33 0.7458 (±0.0244/√100) 🟢 llm-jp/llm-jp-3-172b-instruct2 105.8 (±21.8) 0.850 0.929 0.458
34 0.7444 (±0.0260/√100) 🟢 sbintuitions/sarashina2-70b 120.0 (±49.4) 0.825 0.923 0.485
35 0.7423 (±0.0302/√100) 💬 cyberagent/Llama-3.1-70B-Japanese-Ins... 199.2 (±110.3) 0.817 0.905 0.505
36 0.7407 (±0.0170/√10) 💬 google/gemini-1.5-flash-002 68.4 (±20.2) 0.742 0.960 0.519
37 0.7392 (±0.0232/√100) 🟢 tokyotech-llm/Llama-3.1-Swallow-70B-I... 93.6 (±23.5) 0.847 0.941 0.429
38 0.7370 (±0.0217/√100) 🟢 tokyotech-llm/Llama-3.1-Swallow-70B-I... 97.5 (±19.8) 0.846 0.932 0.433
39 0.7365 (±0.0218/√100) 🟢 CohereForAI/c4ai-command-r-plus 107.5 (±42.3) 0.818 0.913 0.478
40 0.7336 (±0.0254/√100) 🟢 tokyotech-llm/Llama-3-Swallow-70B-v0.1 108.2 (±24.7) 0.837 0.908 0.456
41 0.7329 (±0.0191/√100) 💬 mistralai/Mistral-Large-Instruct-2411 124.5 (±28.2) 0.828 0.902 0.469
42 0.7325 (±0.0229/√100) 🟢 llm-jp/llm-jp-3-13b-instruct3 110.0 (±21.9) 0.823 0.905 0.469
43 0.7320 (±0.0201/√10) 💬 anthropic/claude-3-sonnet-20240229 114.3 (±18.9) 0.810 0.910 0.476
44 0.7297 (±0.0225/√100) 🟢 sbintuitions/sarashina2.2-3b 108.3 (±19.5) 0.817 0.905 0.467
45 0.7294 (±0.0229/√100) 🟢 llm-jp/llm-jp-3-172b 101.8 (±17.4) 0.826 0.921 0.441
46 0.7273 (±0.0233/√10) 💬 google/gemini-2.0-flash-exp 60.7 (±16.3) 0.727 0.978 0.476
47 0.7262 (±0.0215/√100) 💬 mistralai/Mistral-Large-Instruct-2411 120.8 (±25.8) 0.822 0.899 0.458
48 0.7250 (±0.0261/√100) 🟢 llm-jp/llm-jp-3-13b-instruct2 108.8 (±21.4) 0.827 0.906 0.442
49 0.7249 (±0.0247/√100) 💬 cyberagent/calm3-22b-chat 136.8 (±46.7) 0.813 0.907 0.455
50 0.7246 (±0.0250/√100) 💬 tokyotech-llm/Llama-3.1-Swallow-70B-I... 89.8 (±33.9) 0.812 0.940 0.422
51 0.7217 (±0.0219/√100) 🟢 cyberagent/calm3-22b-chat 105.0 (±13.1) 0.824 0.916 0.425
52 0.7194 (±0.0321/√10) 💬 google/text-bison 77.6 (±31.9) 0.790 0.968 0.401
53 0.7191 (±0.0194/√100) 💬 sbintuitions/sarashina2.2-3b-instruct... 171.7 (±62.0) 0.814 0.879 0.464
54 0.7185 (±0.0000/√1) 💬 elyza/Llama-3-ELYZA-JP-70B 98.6 (±33.8) 0.837 0.931 0.388
55 0.7175 (±0.0257/√100) 🟢 nvidia/nemotron-4-340b-instruct 107.3 (±28.4) 0.816 0.908 0.429
56 0.7174 (±0.0243/√100) 🟢 llm-jp/llm-jp-3-13b-instruct 108.3 (±21.1) 0.807 0.906 0.439
57 0.7166 (±0.0305/√100) 🟢 llm-jp/llm-jp-3-172b-beta2 101.6 (±20.5) 0.814 0.918 0.417
58 0.7086 (±0.0192/√100) 🟢 mistralai/Mistral-Large-Instruct-2411 104.5 (±16.2) 0.810 0.900 0.415
59 0.7084 (±0.0207/√100) 🟢 tokyotech-llm/Llama-3.1-Swallow-8B-In... 95.9 (±19.7) 0.835 0.930 0.360
60 0.7073 (±0.0239/√100) 🟢 llm-jp/llm-jp-3-172b-instruct3 108.6 (±23.1) 0.799 0.908 0.414
61 0.7061 (±0.0205/√100) 🟢 AXCXEPT/EZO-Qwen2.5-72B-Instruct 140.5 (±62.0) 0.796 0.894 0.428
62 0.7046 (±0.0248/√100) 💬 nvidia/nemotron-4-340b-instruct 94.5 (±39.1) 0.768 0.910 0.435
63 0.7029 (±0.0258/√100) 🟢 mlx-community/plamo-2-8b-4bit 105.1 (±36.1) 0.821 0.909 0.379
64 0.7024 (±0.0238/√100) 🟢 rinna/nekomata-14b 104.3 (±18.0) 0.812 0.912 0.383
65 0.7023 (±0.0271/√100) 🟢 tokyotech-llm/Llama-3.1-Swallow-8B-v0.2 112.6 (±33.2) 0.818 0.901 0.388
66 0.7016 (±0.0212/√100) 🟢 llm-jp/llm-jp-3-7.2b-instruct2 106.5 (±20.0) 0.810 0.902 0.393
67 0.7008 (±0.0318/√100) 🟢 tokyotech-llm/Swallow-13b-instruct-hf 104.5 (±13.0) 0.812 0.898 0.392
68 0.7000 (±0.0271/√100) 💬 llm-jp/llm-jp-3-13b-instruct 192.0 (±114.0) 0.780 0.890 0.430
69 0.6990 (±0.0288/√100) 🟢 tokyotech-llm/Swallow-13b-NVE-hf 106.2 (±19.2) 0.820 0.906 0.371
70 0.6980 (±0.0252/√100) 💬 tokyotech-llm/Llama-3.1-Swallow-8B-In... 98.7 (±50.0) 0.798 0.927 0.369
71 0.6969 (±0.0219/√100) 🟢 llm-jp/llm-jp-3-7.2b-instruct3 107.3 (±18.4) 0.798 0.896 0.396
72 0.6958 (±0.0236/√100) 🟢 tokyotech-llm/Llama-3.1-Swallow-8B-In... 92.9 (±20.0) 0.814 0.931 0.343
73 0.6945 (±0.0300/√100) 🟢 sbintuitions/sarashina2-13b 107.8 (±28.3) 0.794 0.900 0.390
74 0.6938 (±0.0217/√100) 🟢 weblab-GENIAC/Tanuki-8B-dpo-v1.0 111.5 (±22.8) 0.800 0.893 0.389
75 0.6924 (±0.0232/√100) 💬 tokyotech-llm/Llama-3.1-Swallow-70B-I... 74.1 (±31.4) 0.755 0.948 0.373
76 0.6891 (±0.0255/√100) 🟢 tokyotech-llm/Swallow-13b-hf 104.8 (±17.7) 0.811 0.901 0.355
77 0.6853 (±0.0201/√100) 🟢 tokyotech-llm/Llama-3.1-Swallow-8B-In... 96.6 (±18.8) 0.815 0.919 0.322
78 0.6844 (±0.0239/√100) 🟢 llm-jp/llm-jp-3-172b-beta1 103.0 (±16.0) 0.785 0.900 0.369
79 0.6820 (±0.0232/√100) 💬 llm-jp/llm-jp-3-7.2b-instruct 182.5 (±105.7) 0.781 0.883 0.381
80 0.6808 (±0.0228/√100) 💬 llm-jp/llm-jp-3-172b-instruct2 254.5 (±138.6) 0.780 0.887 0.376
81 0.6794 (±0.0243/√100) 🟢 cyberagent/Llama-3.1-70B-Japanese-Ins... 128.8 (±72.2) 0.764 0.883 0.391
82 0.6787 (±0.0267/√100) 💬 llm-jp/llm-jp-3-13b-instruct3 245.0 (±129.9) 0.770 0.875 0.391
83 0.6764 (±0.0217/√100) 🟢 llm-jp/llm-jp-3-7.2b-instruct 104.7 (±19.4) 0.775 0.890 0.364
84 0.6759 (±0.0232/√10) 🟢 meta-llama/Meta-Llama-3.1-405B 101.2 (±15.1) 0.767 0.892 0.368
85 0.6746 (±0.0215/√100) 💬 llm-jp/llm-jp-3-172b-instruct3 216.1 (±98.9) 0.756 0.875 0.393
86 0.6737 (±0.0276/√100) 🟢 sbintuitions/sarashina1-13b 105.4 (±23.4) 0.775 0.882 0.364
87 0.6715 (±0.0284/√100) 🟢 tokyotech-llm/Llama-3.1-Swallow-8B-v0.1 107.5 (±22.2) 0.787 0.881 0.347
88 0.6697 (±0.0277/√100) 🟢 nvidia/nemotron-4-340b-base 106.9 (±26.5) 0.768 0.884 0.357
89 0.6677 (±0.0250/√100) 🟢 llm-jp/llm-jp-3-13b 101.1 (±9.7) 0.770 0.884 0.349
90 0.6673 (±0.0221/√100) 💬 llm-jp/llm-jp-3-7.2b-instruct3 234.2 (±116.7) 0.768 0.872 0.363
91 0.6673 (±0.0225/√100) 🟢 sbintuitions/sarashina1-65b 104.2 (±20.0) 0.776 0.894 0.332
92 0.6663 (±0.0262/√100) 🟢 tokyotech-llm/Swallow-7b-plus-hf 106.1 (±18.1) 0.780 0.880 0.339
93 0.6640 (±0.0292/√100) 💬 llm-jp/llm-jp-3-13b-instruct2 256.5 (±153.0) 0.755 0.870 0.368
94 0.6634 (±0.0252/√100) 💬 llm-jp/llm-jp-3-7.2b-instruct2 249.5 (±141.8) 0.768 0.872 0.351
95 0.6625 (±0.0140/√10) 💬 anthropic/claude-3-haiku-20240307 81.9 (±31.0) 0.747 0.943 0.298
96 0.6624 (±0.0000/√1) 💬 openai/chatgpt-o3-mini-high 68.1 (±14.5) 0.632 0.925 0.430
97 0.6616 (±0.0378/√10) 💬 google/gemini-1.0-pro-002 118.7 (±90.9) 0.689 0.894 0.402
98 0.6590 (±0.0133/√10) 💬 google/gemini-2.0-flash-thinking-exp-... 49.8 (±11.0) 0.639 0.984 0.354
99 0.6572 (±0.0518/√100) 💬 tokyotech-llm/Llama-3.1-Swallow-8B-In... 108.9 (±63.7) 0.764 0.895 0.313
100 0.6494 (±0.0260/√100) 🟢 Qwen/Qwen2.5-72b 106.8 (±48.2) 0.749 0.863 0.337
101 0.6473 (±0.0182/√100) 💬 Qwen/Qwen2-72B-Instruct 108.7 (±24.8) 0.703 0.853 0.386
102 0.6456 (±0.0255/√100) 🟢 sbintuitions/sarashina2-7b 105.6 (±22.8) 0.746 0.874 0.316
103 0.6447 (±0.0251/√100) 💬 tokyotech-llm/Llama-3.1-Swallow-8B-In... 74.3 (±31.3) 0.706 0.934 0.294
104 0.6445 (±0.0241/√100) 🟢 tokyotech-llm/Llama-3-Swallow-8B-v0.1 110.3 (±28.4) 0.748 0.867 0.319
105 0.6420 (±0.0259/√100) 🟢 microsoft/phi-4 104.2 (±15.2) 0.754 0.864 0.309
106 0.6407 (±0.0242/√100) 🟢 AXCXEPT/Llama-3.1-70B-EZO-1.1-it 147.8 (±92.9) 0.721 0.844 0.357
107 0.6406 (±0.0139/√100) 💬 Qwen/QwQ-32B-Preview 119.1 (±72.2) 0.730 0.897 0.294
108 0.6399 (±0.1763/√100) 💬 turing-motors/Llama-3-heron-brain-70B... 155.4 (±101.8) 0.718 0.805 0.397
109 0.6379 (±0.0263/√100) 🟢 llm-jp/llm-jp-3-3.7b-instruct2 106.8 (±22.2) 0.743 0.867 0.304
110 0.6368 (±0.0207/√100) 🟢 tokyotech-llm/Swallow-MX-8x7b-NVE-v0.1 105.5 (±21.0) 0.753 0.870 0.287
111 0.6350 (±0.0260/√100) 🟢 karakuri-ai/karakuri-lm-8x7b-instruct... 104.0 (±16.9) 0.755 0.863 0.287
112 0.6337 (±0.0265/√100) 🟢 tokyotech-llm/Swallow-7b-hf 106.5 (±18.7) 0.746 0.866 0.289
113 0.6335 (±0.0252/√100) 🟢 karakuri-ai/karakuri-lm-8x7b-chat-v0.1 103.2 (±16.6) 0.766 0.872 0.263
114 0.6318 (±0.0264/√100) 🟢 tokyotech-llm/Llama-3-Swallow-70B-Ins... 119.2 (±74.3) 0.724 0.861 0.311
115 0.6311 (±0.0226/√100) 💬 llm-jp/llm-jp-3-3.7b-instruct 193.2 (±119.8) 0.732 0.847 0.314
116 0.6310 (±0.0127/√100) 💬 Qwen/Qwen2.5-32B-Instruct 75.4 (±19.3) 0.634 0.898 0.360
117 0.6303 (±0.0252/√100) 🟢 cyberagent/calm2-7b-chat-dpo-experime... 110.0 (±24.3) 0.735 0.863 0.293
118 0.6302 (±0.0233/√100) 🟢 llm-jp/llm-jp-3-3.7b-instruct 102.9 (±18.0) 0.738 0.863 0.289
119 0.6297 (±0.0150/√100) 💬 Qwen/Qwen2.5-32B-Instruct 71.1 (±18.7) 0.634 0.906 0.349
120 0.6295 (±0.0226/√100) 💬 microsoft/phi-4 117.8 (±34.9) 0.706 0.843 0.340
121 0.6294 (±0.0267/√100) 💬 microsoft/phi-4 117.8 (±37.7) 0.705 0.846 0.337
122 0.6291 (±0.0207/√100) 💬 Qwen/QwQ-32B-Preview 229.6 (±135.9) 0.719 0.867 0.301
123 0.6285 (±0.0239/√100) 🟢 pfnet/nekomata-14b-pfn-qfin-inst-merge 124.7 (±47.2) 0.725 0.866 0.295
124 0.6279 (±0.0252/√100) 🟢 tokyotech-llm/Swallow-7b-NVE-hf 108.1 (±24.5) 0.747 0.870 0.267
125 0.6274 (±0.0772/√100) 🟢 rinna/nekomata-14b-instruction 98.3 (±24.2) 0.732 0.855 0.295
126 0.6267 (±0.0263/√100) 🟢 sbintuitions/sarashina1-7b 106.7 (±25.1) 0.737 0.866 0.276
127 0.6252 (±0.0246/√100) 🟢 karakuri-ai/karakuri-lm-70b-v0.1 106.0 (±27.0) 0.713 0.852 0.310
128 0.6202 (±0.0251/√100) 🟢 stabilityai/japanese-stablelm-base-be... 107.3 (±19.2) 0.733 0.848 0.280
129 0.6197 (±0.0258/√100) 🟢 stockmark/stockmark-13b 108.9 (±49.3) 0.727 0.860 0.272
130 0.6191 (±0.0284/√100) 🟢 stockmark/stockmark-13b-instruct 108.0 (±46.8) 0.720 0.859 0.278
131 0.6178 (±0.0230/√100) 🟢 karakuri-ai/karakuri-lm-70b-chat-v0.1 104.7 (±27.5) 0.706 0.842 0.306
132 0.6176 (±0.0249/√100) 🟢 tokyotech-llm/Swallow-7b-instruct-hf 106.3 (±17.8) 0.716 0.851 0.285
133 0.6167 (±0.0213/√100) 💬 sbintuitions/sarashina2.2-3b-instruct... 491.1 (±121.0) 0.718 0.829 0.302
134 0.6160 (±0.0195/√100) 🟢 AXCXEPT/EZO-Qwen2.5-32B-Instruct 196.8 (±119.0) 0.690 0.848 0.310
135 0.6149 (±0.0153/√100) 💬 Qwen/Qwen2.5-14B-Instruct 76.5 (±18.4) 0.644 0.893 0.308
136 0.6136 (±0.0143/√10) 💬 openai/gpt-35-turbo 64.0 (±22.2) 0.658 0.944 0.239
137 0.6105 (±0.0288/√100) 💬 llm-jp/llm-jp-3-3.7b-instruct3 189.9 (±101.5) 0.697 0.834 0.301
138 0.6095 (±0.0225/√100) 💬 rinna/llama-3-youko-70b-instruct 135.3 (±46.8) 0.683 0.817 0.328
139 0.6091 (±0.0277/√100) 🟢 pfnet/nekomata-14b-pfn-qfin 85.1 (±28.4) 0.672 0.893 0.262
140 0.6087 (±0.1545/√100) 💬 tokyotech-llm/Swallow-70b-NVE-instruc... 135.7 (±74.0) 0.678 0.804 0.344
141 0.6085 (±0.0387/√100) 💬 llm-jp/llm-jp-3-3.7b-instruct2 207.7 (±130.6) 0.692 0.832 0.301
142 0.6085 (±0.0264/√100) 🟢 llm-jp/llm-jp-3-7.2b 104.0 (±14.7) 0.713 0.851 0.262
143 0.6063 (±0.0213/√100) 💬 Qwen/Qwen2.5-14B-Instruct 80.0 (±21.8) 0.639 0.889 0.290
144 0.6060 (±0.0238/√100) 🟢 Qwen/Qwen2-72B 105.5 (±23.5) 0.703 0.836 0.279
145 0.6037 (±0.0239/√100) 🟢 tokyotech-llm/Swallow-7b-NVE-instruct-hf 105.7 (±16.4) 0.719 0.847 0.245
146 0.6030 (±0.0287/√100) 💬 karakuri-ai/karakuri-lm-8x7b-instruct... 197.4 (±72.1) 0.703 0.832 0.274
147 0.6029 (±0.0223/√100) 🟢 Qwen/Qwen2-72B-Instruct 106.0 (±26.7) 0.684 0.825 0.299
148 0.5987 (±0.0264/√100) 🟢 cyberagent/calm2-7b-chat 107.5 (±20.8) 0.701 0.843 0.253
149 0.5971 (±0.0235/√100) 🟢 stockmark/stockmark-100b 107.2 (±24.7) 0.709 0.842 0.240
150 0.5945 (±0.1370/√100) 💬 tokyotech-llm/Swallow-13b-instruct-hf 167.3 (±116.4) 0.670 0.790 0.323
151 0.5921 (±0.0211/√100) 🟢 elyza/Llama-3-ELYZA-JP-8B 115.6 (±44.8) 0.685 0.831 0.260
152 0.5866 (±0.0202/√100) 🟢 Qwen/Qwen2.5-32b 104.7 (±26.9) 0.690 0.820 0.250
153 0.5852 (±0.0208/√100) 💬 llm-jp/llm-jp-3-13b-instruct3 347.6 (±147.8) 0.672 0.806 0.277
154 0.5832 (±0.0220/√100) 🟢 augmxnt/shisa-gamma-7b-v1 106.7 (±21.8) 0.706 0.831 0.213
155 0.5825 (±0.0249/√100) 🟢 tokyotech-llm/Swallow-MS-7b-v0.1 106.4 (±25.9) 0.702 0.828 0.218
156 0.5811 (±0.0218/√100) 🟢 llm-jp/llm-jp-13b-instruct-full-ac_00... 103.6 (±15.6) 0.675 0.816 0.252
157 0.5808 (±0.0220/√100) 🟢 stabilityai/japanese-stablelm-base-ga... 106.9 (±17.2) 0.690 0.822 0.230
158 0.5806 (±0.0254/√100) 🟢 sbintuitions/sarashina2.2-1b 107.4 (±26.2) 0.692 0.827 0.223
159 0.5793 (±0.0202/√100) 💬 llm-jp/llm-jp-3-172b-instruct3 372.5 (±133.4) 0.655 0.806 0.277
160 0.5783 (±0.0217/√100) 🟢 microsoft/Phi-3-medium-4k-instruct 105.9 (±20.0) 0.675 0.826 0.234
161 0.5777 (±0.0228/√100) 🟢 llm-jp/llm-jp-13b-instruct-full-dolly... 105.2 (±14.5) 0.675 0.811 0.247
162 0.5754 (±0.0182/√100) 🟢 Xwin-LM/Xwin-LM-70B-V0.1 105.4 (±26.8) 0.681 0.833 0.213
163 0.5737 (±0.0209/√100) 🟢 microsoft/Phi-3-medium-128k-instruct 107.7 (±24.7) 0.674 0.825 0.223
164 0.5735 (±0.0216/√100) 🟢 google/gemma-2-9b-it 95.9 (±22.0) 0.674 0.837 0.209
165 0.5734 (±0.1980/√100) 💬 tokyotech-llm/Swallow-70b-instruct-hf 130.9 (±105.0) 0.636 0.758 0.326
166 0.5724 (±0.0209/√100) 🟢 rinna/llama-3-youko-70b 104.6 (±20.6) 0.681 0.826 0.210
167 0.5716 (±0.0230/√100) 🟢 sbintuitions/sarashina2.1-1b 116.9 (±41.3) 0.668 0.821 0.226
168 0.5712 (±0.0194/√100) 💬 karakuri-ai/karakuri-lm-8x7b-chat-v0.1 244.4 (±49.3) 0.678 0.816 0.220
169 0.5710 (±0.0198/√100) 🟢 mistralai/Mistral-Small-24B-Instruct-... 114.2 (±30.2) 0.684 0.797 0.232
170 0.5710 (±0.0226/√100) 🟢 rinna/llama-3-youko-8b-instruct 111.6 (±23.4) 0.672 0.809 0.232
171 0.5659 (±0.0234/√100) 🟢 meta-llama/Meta-Llama-3.1-70B 103.7 (±20.1) 0.665 0.822 0.211
172 0.5656 (±0.0226/√100) 💬 meta-llama/Meta-Llama-3-70B-Instruct 110.2 (±36.4) 0.665 0.777 0.254
173 0.5646 (±0.0240/√100) 💬 microsoft/Phi-3-medium-4k-instruct 131.3 (±50.6) 0.633 0.807 0.253
174 0.5642 (±0.0261/√100) 🟢 stabilityai/japanese-stablelm-instruc... 105.1 (±19.5) 0.646 0.799 0.247
175 0.5620 (±0.0254/√100) 🟢 meta-llama/Meta-Llama-3-70B 102.0 (±17.2) 0.664 0.809 0.213
176 0.5590 (±0.0456/√100) 💬 mistralai/Mistral-Small-24B-Instruct-... 105.3 (±42.8) 0.648 0.794 0.235
177 0.5588 (±0.0230/√100) 🟢 stabilityai/japanese-stablelm-instruc... 105.6 (±17.0) 0.673 0.812 0.191
178 0.5574 (±0.0216/√100) 🟢 rinna/nekomata-7b 108.4 (±18.0) 0.678 0.816 0.178
179 0.5569 (±0.0244/√100) 🟢 rinna/llama-3-youko-8b 104.9 (±17.0) 0.670 0.813 0.188
180 0.5568 (±0.0200/√100) 🟢 meta-llama/Meta-Llama-3-70B-Instruct 111.8 (±55.9) 0.655 0.780 0.236
181 0.5562 (±0.0952/√100) 💬 stockmark/stockmark-13b-instruct 137.2 (±89.6) 0.633 0.798 0.238
182 0.5540 (±0.0773/√100) 💬 mistralai/Mistral-Small-24B-Instruct-... 101.9 (±38.4) 0.640 0.773 0.248
183 0.5537 (±0.0204/√100) 🟢 tokyotech-llm/Llama-3-Swallow-8B-Inst... 114.4 (±48.5) 0.657 0.812 0.192
184 0.5531 (±0.0215/√100) 💬 llm-jp/llm-jp-3-7.2b-instruct3 389.6 (±127.7) 0.641 0.787 0.231
185 0.5516 (±0.1016/√100) 💬 cyberagent/calm2-7b-chat-dpo-experime... 181.1 (±120.1) 0.644 0.775 0.236
186 0.5514 (±0.0270/√100) 💬 llm-jp/llm-jp-3-13b-instruct2 365.5 (±161.5) 0.630 0.783 0.241
187 0.5511 (±0.0203/√100) 🟢 google/gemma-2-27b-it 110.3 (±56.8) 0.599 0.836 0.218
188 0.5500 (±0.0605/√100) 💬 tokyotech-llm/Llama-3-Swallow-70B-Ins... 156.5 (±106.5) 0.633 0.780 0.237
189 0.5500 (±0.0467/√100) 💬 tokyotech-llm/Swallow-7b-instruct-hf 121.9 (±77.3) 0.612 0.812 0.225
190 0.5486 (±0.0251/√100) 💬 llm-jp/llm-jp-3-7.2b-instruct2 418.2 (±130.6) 0.637 0.786 0.223
191 0.5469 (±0.0271/√100) 💬 llm-jp/llm-jp-3-172b-instruct2 372.9 (±157.4) 0.619 0.780 0.242
192 0.5465 (±0.0244/√100) 🟢 SakanaAI/TinySwallow-1.5B-Instruct 105.0 (±26.9) 0.657 0.807 0.176
193 0.5437 (±0.0218/√100) 💬 Xwin-LM/Xwin-LM-70B-V0.1 200.7 (±63.1) 0.652 0.782 0.198
194 0.5436 (±0.0246/√100) 🟢 llm-jp/llm-jp-3-3.7b 101.3 (±10.4) 0.646 0.795 0.189
195 0.5432 (±0.0208/√100) 💬 CohereForAI/c4ai-command-r-plus 48.9 (±16.5) 0.505 0.931 0.194
196 0.5429 (±0.0238/√100) 🟢 meta-llama/Meta-Llama-3.1-70B-Instruct 157.6 (±221.7) 0.636 0.770 0.222
197 0.5419 (±0.0234/√100) 🟢 Qwen/Qwen2.5-14B 109.3 (±43.0) 0.648 0.790 0.188
198 0.5416 (±0.0232/√100) 🟢 llm-jp/llm-jp-3-1.8b-instruct2 114.0 (±31.8) 0.651 0.797 0.177
199 0.5406 (±0.0287/√100) 💬 llm-jp/llm-jp-3-13b-instruct 382.1 (±163.5) 0.615 0.771 0.236
200 0.5387 (±0.0269/√100) 💬 rinna/llama-3-youko-8b-instruct 265.4 (±104.1) 0.635 0.771 0.210
201 0.5386 (±0.0215/√100) 💬 microsoft/Phi-3-medium-128k-instruct 91.9 (±44.7) 0.589 0.834 0.193
202 0.5377 (±0.0481/√100) 💬 meta-llama/Meta-Llama-3.1-70B-Instruct 135.8 (±194.8) 0.617 0.779 0.218
203 0.5359 (±0.0214/√100) 🟢 llm-jp/llm-jp-3-1.8b-instruct3 117.5 (±35.4) 0.640 0.786 0.181
204 0.5349 (±0.0203/√100) 💬 google/gemma-2-27b-it 74.7 (±42.7) 0.545 0.874 0.186
205 0.5347 (±0.0188/√100) 🟢 rinna/youri-7b 107.6 (±16.3) 0.654 0.802 0.148
206 0.5330 (±0.0238/√100) 💬 llm-jp/llm-jp-3-7.2b-instruct 406.7 (±152.5) 0.621 0.770 0.208
207 0.5316 (±0.0273/√100) 💬 lightblue/karasu-7B-chat 111.8 (±46.5) 0.621 0.800 0.174
208 0.5301 (±0.0476/√100) 💬 lightblue/karasu-7B-chat-plus 107.1 (±46.7) 0.615 0.798 0.178
209 0.5283 (±0.0309/√100) 💬 SakanaAI/TinySwallow-1.5B-Instruct 117.7 (±61.8) 0.616 0.801 0.168
210 0.5283 (±0.0585/√100) 💬 lightblue/karasu-7B-chat-plus-unleashed 104.6 (±45.3) 0.614 0.794 0.177
211 0.5223 (±0.0441/√100) 🟢 Fugaku-LLM/Fugaku-LLM-13B 94.2 (±20.5) 0.588 0.818 0.161
212 0.5199 (±0.0281/√100) 🟢 llm-jp/llm-jp-3-172b-alpha2 104.6 (±22.2) 0.606 0.782 0.171
213 0.5190 (±0.0203/√100) 🟢 mistralai/Mistral-Small-24B-Base-2501 107.2 (±32.7) 0.626 0.771 0.160
214 0.5179 (±0.0264/√100) 🟢 cyberagent/calm2-7b 106.0 (±26.2) 0.601 0.770 0.182
215 0.5164 (±0.0209/√100) 🟢 llm-jp/llm-jp-13b-instruct-full-jaste... 109.3 (±33.5) 0.606 0.788 0.155
216 0.5143 (±0.0212/√100) 🟢 llm-jp/llm-jp-13b-v2.0 104.1 (±11.2) 0.604 0.760 0.180
217 0.5143 (±0.0170/√100) 🟢 moneyforward/houou-instruction-7b-v3 112.2 (±37.8) 0.629 0.778 0.135
218 0.5122 (±0.0132/√100) 💬 Qwen/Qwen2.5-7B-Instruct 69.5 (±28.7) 0.557 0.847 0.132
219 0.5119 (±0.0190/√100) 💬 llm-jp/llm-jp-3-3.7b-instruct3 360.0 (±134.7) 0.594 0.753 0.189
220 0.5111 (±0.0203/√100) 🟢 llm-jp/llm-jp-3-1.8b-instruct 113.1 (±33.9) 0.615 0.772 0.147
221 0.5103 (±0.0204/√100) 💬 llm-jp/llm-jp-3-3.7b-instruct 441.6 (±144.2) 0.606 0.750 0.175
222 0.5085 (±0.0160/√100) 🟢 moneyforward/houou-instruction-7b-v1 105.9 (±41.0) 0.617 0.781 0.128
223 0.5080 (±0.0306/√100) 💬 stabilityai/japanese-stablelm-instruc... 111.3 (±58.3) 0.548 0.782 0.195
224 0.5073 (±0.0208/√100) 💬 Qwen/Qwen2-57B-A14B-Instruct 154.8 (±89.5) 0.615 0.734 0.173
225 0.5045 (±0.0208/√100) 🟢 Qwen/Qwen2-57B-A14B 106.7 (±22.5) 0.617 0.757 0.139
226 0.5041 (±0.0225/√100) 🟢 llm-jp/llm-jp-13b-instruct-full-dolly... 106.2 (±29.3) 0.579 0.778 0.155
227 0.5037 (±0.0264/√100) 💬 llm-jp/llm-jp-3-3.7b-instruct2 365.8 (±145.5) 0.590 0.746 0.175
228 0.5022 (±0.0221/√100) 🟢 llm-jp/llm-jp-13b-instruct-full-jaste... 95.0 (±36.2) 0.579 0.795 0.132
229 0.5013 (±0.0196/√100) 🟢 google/gemma-2-9b 107.3 (±26.0) 0.595 0.761 0.148
230 0.5013 (±0.0375/√100) 💬 karakuri-ai/karakuri-lm-70b-chat-v0.1 427.4 (±151.5) 0.579 0.723 0.202
231 0.5006 (±0.0476/√100) 💬 llm-jp/llm-jp-3-1.8b-instruct3 223.2 (±122.4) 0.590 0.744 0.168
232 0.5002 (±0.0218/√100) 🟢 Qwen/Qwen-72B-Chat 223.0 (±258.3) 0.614 0.716 0.171
233 0.4995 (±0.0211/√100) 💬 Qwen/Qwen1.5-72B-Chat 119.3 (±58.1) 0.582 0.708 0.208
234 0.4988 (±0.0240/√100) 🟢 sbintuitions/sarashina2.2-0.5b 112.7 (±33.2) 0.614 0.758 0.124
235 0.4973 (±0.0236/√100) 🟢 pfnet/plamo-2-1b 112.6 (±37.4) 0.601 0.771 0.121
236 0.4970 (±0.0117/√100) 💬 Qwen/Qwen2.5-7B-Instruct 65.0 (±22.0) 0.535 0.858 0.098
237 0.4963 (±0.0189/√100) 🟢 Qwen/Qwen1.5-72B-Chat 128.1 (±77.7) 0.586 0.698 0.206
238 0.4959 (±0.0235/√100) 🟢 llm-jp/llm-jp-13b-v1.0 115.0 (±40.9) 0.576 0.756 0.156
239 0.4955 (±0.0602/√100) 💬 llm-jp/llm-jp-3-1.8b-instruct2 194.1 (±123.5) 0.581 0.740 0.166
240 0.4953 (±0.0203/√100) 🟢 meta-llama/Llama-2-70b-hf 110.4 (±25.8) 0.596 0.745 0.145
241 0.4949 (±0.0177/√100) 💬 moneyforward/houou-instruction-7b-v1 180.5 (±66.6) 0.604 0.734 0.146
242 0.4931 (±0.0247/√100) 🟢 Rakuten/RakutenAI-7B-instruct 105.6 (±33.1) 0.598 0.750 0.132
243 0.4921 (±0.0219/√100) 🟢 Rakuten/RakutenAI-7B-chat 114.9 (±44.7) 0.592 0.760 0.124
244 0.4921 (±0.0285/√100) 💬 llm-jp/llm-jp-3-1.8b-instruct 185.0 (±120.2) 0.585 0.752 0.140
245 0.4916 (±0.0201/√100) 🟢 moneyforward/houou-instruction-7b-v2 104.7 (±41.2) 0.588 0.770 0.116
246 0.4912 (±0.0399/√100) 💬 SakanaAI/TinySwallow-1.5B-Instruct 222.0 (±126.2) 0.594 0.735 0.145
247 0.4895 (±0.0440/√100) 💬 llm-jp/llm-jp-13b-instruct-full-dolly... 268.1 (±133.1) 0.548 0.722 0.199
248 0.4872 (±0.0237/√100) 🟢 lightblue/karasu-7B 110.1 (±19.0) 0.586 0.739 0.137
249 0.4870 (±0.0215/√100) 🟢 Qwen/Qwen-72B 134.6 (±114.6) 0.593 0.715 0.152
250 0.4868 (±0.0163/√100) 💬 google/gemma-2-9b-it 47.6 (±14.6) 0.477 0.880 0.104
251 0.4863 (±0.1167/√100) 💬 pfnet/nekomata-14b-pfn-qfin-inst-merge 93.4 (±55.0) 0.544 0.721 0.194
252 0.4862 (±0.0221/√100) 🟢 Qwen/Qwen2-57B-A14B-Instruct 116.9 (±82.5) 0.601 0.734 0.124
253 0.4857 (±0.0168/√100) 💬 moneyforward/houou-instruction-7b-v2 207.0 (±57.3) 0.591 0.719 0.147
254 0.4829 (±0.0211/√100) 🟢 Qwen/Qwen1.5-72B 136.2 (±85.6) 0.591 0.705 0.153
255 0.4827 (±0.0464/√100) 💬 llm-jp/llm-jp-13b-instruct-full-ac_00... 269.1 (±131.5) 0.542 0.716 0.191
256 0.4762 (±0.0810/√100) 💬 stabilityai/japanese-stablelm-instruc... 126.2 (±67.4) 0.545 0.726 0.158
257 0.4746 (±0.0210/√100) 🟢 rinna/youri-7b-chat 102.1 (±16.4) 0.571 0.752 0.100
258 0.4744 (±0.0227/√100) 🟢 pfnet/plamo-13b 108.2 (±28.5) 0.558 0.749 0.116
259 0.4743 (±0.0987/√100) 💬 tokyotech-llm/Swallow-7b-NVE-instruct-hf 129.0 (±72.8) 0.535 0.725 0.163
260 0.4731 (±0.0270/√100) 🟢 mlx-community/plamo-2-1b 121.5 (±79.9) 0.576 0.738 0.105
261 0.4730 (±0.0166/√100) 🟢 Xwin-LM/Xwin-LM-13B-V0.2 109.7 (±27.4) 0.582 0.723 0.114
262 0.4723 (±0.0204/√100) 💬 Rakuten/RakutenAI-7B-chat 233.0 (±133.0) 0.565 0.734 0.118
263 0.4723 (±0.0808/√100) 💬 tokyotech-llm/Llama-3-Swallow-8B-Inst... 199.3 (±155.6) 0.563 0.699 0.154
264 0.4718 (±0.0262/√100) 🟢 mlx-community/plamo-2-1b-bf16 121.5 (±80.5) 0.574 0.739 0.103
265 0.4698 (±0.0200/√100) 🟢 Rakuten/RakutenAI-7B 105.4 (±25.6) 0.576 0.721 0.113
266 0.4692 (±0.0161/√100) 🟢 shisa-ai/shisa-v1-qwen2-7b 109.0 (±23.9) 0.563 0.712 0.133
267 0.4691 (±0.0264/√100) 🟢 sbintuitions/sarashina2.2-1b-instruct... 156.3 (±59.3) 0.595 0.638 0.174
268 0.4683 (±0.0211/√100) 💬 llm-jp/llm-jp-3-1.8b-instruct3 402.8 (±140.7) 0.552 0.720 0.133
269 0.4674 (±0.0211/√100) 🟢 Qwen/Qwen2.5-7B 111.5 (±51.4) 0.563 0.707 0.132
270 0.4670 (±0.0202/√100) 💬 llm-jp/llm-jp-3-1.8b-instruct2 400.7 (±146.8) 0.556 0.721 0.124
271 0.4661 (±0.0210/√100) 🟢 llm-jp/llm-jp-13b-instruct-full-dolly... 111.6 (±44.2) 0.536 0.756 0.106
272 0.4659 (±0.0438/√100) 💬 deepseek-ai/deepseek-llm-67b-chat 146.0 (±62.1) 0.555 0.703 0.139
273 0.4659 (±0.0202/√100) 🟢 llm-jp/llm-jp-3-1.8b 105.0 (±16.9) 0.568 0.725 0.105
274 0.4648 (±0.1659/√100) 💬 cyberagent/calm2-7b-chat 124.7 (±95.9) 0.536 0.688 0.171
275 0.4622 (±0.0195/√100) 🟢 Qwen/Qwen-14B-Chat 135.5 (±84.3) 0.572 0.718 0.097
276 0.4619 (±0.0162/√100) 💬 lmsys/vicuna-13b-v1.5-16k 126.5 (±48.4) 0.574 0.715 0.097
277 0.4609 (±0.0113/√10) 🟢 google/gemma-2-2b-jpn-it 69.4 (±24.1) 0.509 0.805 0.069
278 0.4607 (±0.0165/√100) 🟢 SakanaAI/EvoLLM-JP-v1-7B 111.2 (±30.4) 0.579 0.708 0.095
279 0.4601 (±0.0184/√100) 🟢 shisa-ai/shisa-v1-llama3-8b 112.9 (±31.4) 0.557 0.703 0.120
280 0.4597 (±0.0268/√100) 🟢 CohereForAI/c4ai-command-r-v01 179.2 (±166.3) 0.590 0.592 0.197
281 0.4586 (±0.0141/√100) 🟢 google/gemma-2-2b-it 88.2 (±30.8) 0.536 0.761 0.079
282 0.4578 (±0.0210/√100) 🟢 llm-jp/llm-jp-3-980m-instruct2 112.3 (±46.7) 0.559 0.723 0.091
283 0.4570 (±0.0253/√100) 🟢 llm-jp/llm-jp-3-172b-alpha1 111.1 (±34.7) 0.530 0.715 0.126
284 0.4561 (±0.0202/√100) 🟢 pfnet/plamo-13b-instruct 144.0 (±147.7) 0.532 0.763 0.073
285 0.4559 (±0.0201/√100) 🟢 pfnet/plamo-13b-instruct-nc 156.0 (±183.1) 0.523 0.768 0.077
286 0.4558 (±0.0156/√100) 🟢 rinna/japanese-gpt-neox-3.6b-instruct... 75.3 (±26.6) 0.488 0.804 0.076
287 0.4543 (±0.0217/√100) 🟢 rinna/youri-7b-instruction 96.2 (±29.5) 0.530 0.743 0.090
288 0.4535 (±0.0348/√100) 💬 Rakuten/RakutenAI-7B-instruct 128.6 (±83.2) 0.527 0.726 0.108
289 0.4535 (±0.0183/√100) 🟢 THUDM/glm-4-9b 110.3 (±36.9) 0.554 0.689 0.118
290 0.4527 (±0.0146/√100) 🟢 lmsys/vicuna-13b-v1.5-16k 107.9 (±25.9) 0.576 0.708 0.075
291 0.4525 (±0.0187/√100) 💬 llm-jp/llm-jp-3-1.8b-instruct 435.4 (±148.4) 0.553 0.706 0.098
292 0.4516 (±0.0276/√100) 💬 sbintuitions/sarashina2.2-1b-instruct... 337.2 (±153.2) 0.573 0.622 0.159
293 0.4504 (±0.0224/√100) 🟢 rinna/nekomata-7b-instruction 96.4 (±23.7) 0.528 0.734 0.089
294 0.4486 (±0.0161/√100) 💬 Qwen/Qwen2-7B-Instruct 163.6 (±61.4) 0.547 0.688 0.111
295 0.4484 (±0.0191/√100) 💬 SakanaAI/EvoLLM-JP-v1-7B 123.9 (±68.1) 0.545 0.706 0.094
296 0.4478 (±0.0245/√100) 💬 sbintuitions/sarashina2.2-1b-instruct... 399.9 (±168.4) 0.568 0.626 0.149
297 0.4477 (±0.0205/√100) 🟢 rinna/llama-3-youko-70b-instruct 130.7 (±95.3) 0.527 0.670 0.146
298 0.4459 (±0.0202/√100) 🟢 llm-jp/llm-jp-3-980m-instruct3 116.0 (±33.5) 0.545 0.707 0.086
299 0.4426 (±0.0204/√100) 🟢 elyza/ELYZA-japanese-Llama-2-13b-inst... 111.1 (±28.2) 0.544 0.687 0.097
300 0.4409 (±0.1064/√100) 💬 lightblue/karasu-7B 138.1 (±92.9) 0.512 0.679 0.131
301 0.4404 (±0.0146/√100) 🟢 rinna/bilingual-gpt-neox-4b-instructi... 75.9 (±22.7) 0.493 0.773 0.056
302 0.4387 (±0.0655/√100) 💬 Qwen/Qwen-72B-Chat 117.7 (±137.1) 0.541 0.632 0.143
303 0.4385 (±0.0285/√100) 💬 rinna/youri-7b-chat 95.4 (±41.1) 0.500 0.733 0.083
304 0.4377 (±0.0107/√100) 🟢 google/gemma-1.1-7b-it 86.8 (±21.4) 0.509 0.732 0.072
305 0.4374 (±0.0217/√100) 🟢 Qwen/Qwen1.5-32B-Chat 127.0 (±57.0) 0.538 0.642 0.133
306 0.4368 (±0.0575/√100) 💬 llm-jp/llm-jp-3-980m-instruct2 195.9 (±127.8) 0.529 0.686 0.096
307 0.4336 (±0.0168/√100) 🟢 stabilityai/japanese-stablelm-base-be... 107.1 (±17.2) 0.539 0.689 0.073
308 0.4335 (±0.0221/√100) 🟢 Qwen/Qwen-14B 118.1 (±71.6) 0.530 0.675 0.096
309 0.4332 (±0.0164/√100) 🟢 Qwen/Qwen2-7B-Instruct 119.1 (±45.7) 0.531 0.670 0.098
310 0.4330 (±0.0149/√100) 💬 google/gemma-2-2b-it 56.0 (±27.8) 0.445 0.788 0.066
311 0.4320 (±0.0171/√100) 🟢 Qwen/Qwen2-7B 109.1 (±40.1) 0.532 0.671 0.093
312 0.4296 (±0.0322/√100) 💬 Qwen/Qwen-14B-Chat 159.0 (±69.7) 0.522 0.675 0.092
313 0.4295 (±0.0157/√100) 🟢 elyza/ELYZA-japanese-Llama-2-7b-instruct 111.5 (±31.4) 0.530 0.676 0.083
314 0.4292 (±0.0181/√100) 💬 Xwin-LM/Xwin-LM-13B-V0.2 240.7 (±48.4) 0.533 0.670 0.085
315 0.4282 (±0.0193/√100) 🟢 stabilityai/japanese-stablelm-3b-4e1t... 110.8 (±26.0) 0.518 0.688 0.078
316 0.4272 (±0.0273/√100) 🟢 mistralai/Mistral-Nemo-Instruct-2407 155.8 (±132.8) 0.548 0.611 0.122
317 0.4265 (±0.0115/√100) 💬 google/gemma-1.1-7b-it 78.7 (±28.4) 0.475 0.739 0.066
318 0.4256 (±0.0270/√100) 🟢 rinna/japanese-gpt-neox-3.6b 129.8 (±73.4) 0.485 0.685 0.106
319 0.4228 (±0.0185/√100) 🟢 stabilityai/japanese-stablelm-base-ja... 110.4 (±28.6) 0.528 0.668 0.073
320 0.4222 (±0.0138/√100) 🟢 Xwin-LM/Xwin-LM-7B-V0.2 110.6 (±29.3) 0.520 0.677 0.070
321 0.4220 (±0.0185/√100) 🟢 lmsys/vicuna-7b-v1.5-16k 111.8 (±31.8) 0.522 0.670 0.074
322 0.4207 (±0.0189/√100) 🟢 stabilityai/japanese-stablelm-3b-4e1t... 112.8 (±27.0) 0.507 0.683 0.072
323 0.4201 (±0.0177/√100) 💬 lmsys/vicuna-7b-v1.5-16k 128.1 (±52.5) 0.514 0.668 0.078
324 0.4164 (±0.0244/√100) 🟢 google/gemma-7b 135.5 (±132.3) 0.533 0.631 0.085
325 0.4150 (±0.0212/√100) 💬 Qwen/Qwen1.5-32B-Chat 125.7 (±250.5) 0.496 0.620 0.130
326 0.4149 (±0.0375/√100) 💬 llm-jp/llm-jp-13b-instruct-full-dolly... 186.6 (±108.4) 0.469 0.685 0.090
327 0.4144 (±0.0149/√100) 💬 01-ai/Yi-1.5-34B-Chat 170.6 (±47.1) 0.514 0.628 0.101
328 0.4140 (±0.0208/√100) 🟢 meta-llama/Meta-Llama-3-8B-Instruct 116.8 (±44.3) 0.523 0.637 0.082
329 0.4125 (±0.0303/√100) 💬 CohereForAI/c4ai-command-r-v01 137.7 (±324.6) 0.519 0.562 0.157
330 0.4122 (±0.0199/√100) 🟢 rinna/bilingual-gpt-neox-4b 121.0 (±43.6) 0.485 0.660 0.092
331 0.4097 (±0.0187/√100) 🟢 meta-llama/Meta-Llama-3.1-8B 108.7 (±35.4) 0.512 0.650 0.068
332 0.4087 (±0.0201/√100) 🟢 meta-llama/Llama-2-70b-chat-hf 161.3 (±140.8) 0.519 0.608 0.099
333 0.4087 (±0.0146/√100) 🟢 microsoft/Phi-3-small-8k-instruct 109.1 (±24.1) 0.514 0.644 0.068
334 0.4080 (±0.0206/√100) 💬 llm-jp/llm-jp-3-980m-instruct2 430.8 (±147.5) 0.505 0.653 0.067
335 0.4076 (±0.0142/√100) 🟢 elyza/ELYZA-japanese-Llama-2-7b-fast-... 109.0 (±32.9) 0.503 0.644 0.076
336 0.4074 (±0.0207/√100) 💬 elyza/ELYZA-japanese-Llama-2-13b-inst... 156.6 (±65.9) 0.490 0.646 0.086
337 0.4073 (±0.0175/√100) 🟢 stabilityai/japanese-stablelm-instruc... 110.0 (±26.5) 0.490 0.663 0.070
338 0.4058 (±0.0295/√100) 💬 rinna/youri-7b-instruction 97.0 (±57.0) 0.439 0.713 0.065
339 0.4050 (±0.0191/√100) 🟢 mistralai/Mixtral-8x22B-v0.1 115.6 (±55.4) 0.517 0.615 0.084
340 0.4048 (±0.0175/√100) 🟢 meta-llama/Meta-Llama-3-8B 109.0 (±19.8) 0.505 0.641 0.068
341 0.4048 (±0.0263/√20) 💬 ntt/tsuzumi-7b 172.0 (±90.8) 0.491 0.644 0.080
342 0.4045 (±0.0186/√100) 🟢 rinna/japanese-gpt-neox-3.6b-instruct... 133.1 (±57.4) 0.475 0.678 0.061
343 0.4044 (±0.0219/√100) 💬 sbintuitions/sarashina2.2-0.5b-instru... 217.6 (±82.9) 0.532 0.590 0.091
344 0.4042 (±0.0131/√100) 🟢 microsoft/Orca-2-13b 115.5 (±42.6) 0.510 0.630 0.073
345 0.4041 (±0.0218/√100) 💬 meta-llama/Meta-Llama-3-8B-Instruct 131.4 (±88.3) 0.508 0.614 0.090
346 0.4035 (±0.0151/√100) 🟢 SakanaAI/EvoLLM-JP-A-v1-7B 110.4 (±31.3) 0.508 0.633 0.069
347 0.4033 (±0.0164/√100) 🟢 elyza/ELYZA-japanese-Llama-2-13b-fast... 107.2 (±28.5) 0.495 0.643 0.072
348 0.4032 (±0.0237/√100) 🟢 Qwen/Qwen1.5-32B 150.3 (±104.8) 0.505 0.605 0.100
349 0.4024 (±0.0187/√100) 🟢 01-ai/Yi-1.5-34B 109.9 (±28.2) 0.493 0.631 0.083
350 0.4014 (±0.0195/√100) 🟢 sbintuitions/sarashina2.2-0.5b-instru... 160.5 (±57.9) 0.532 0.581 0.091
351 0.4013 (±0.0162/√100) 🟢 Qwen/Qwen2.5-3B 113.3 (±35.0) 0.504 0.628 0.072
352 0.4011 (±0.0236/√100) 🟢 cyberagent/open-calm-7b 143.8 (±97.0) 0.472 0.641 0.091
353 0.4006 (±0.0166/√100) 💬 microsoft/Phi-3-small-8k-instruct 189.7 (±84.1) 0.500 0.630 0.073
354 0.4001 (±0.0199/√100) 🟢 rinna/japanese-gpt-neox-3.6b-instruct... 117.6 (±48.9) 0.464 0.684 0.052
355 0.3985 (±0.0161/√100) 🟢 elyza/ELYZA-japanese-Llama-2-13b 138.4 (±51.8) 0.493 0.634 0.069
356 0.3960 (±0.0199/√100) 🟢 line-corporation/japanese-large-lm-1.7b 179.2 (±174.5) 0.474 0.650 0.065
357 0.3953 (±0.0207/√100) 💬 llm-jp/llm-jp-3-980m-instruct3 404.7 (±156.1) 0.482 0.637 0.067
358 0.3949 (±0.0193/√100) 💬 meta-llama/Meta-Llama-3.1-8B-Instruct 216.6 (±345.2) 0.487 0.624 0.074
359 0.3948 (±0.0190/√100) 💬 Qwen/Qwen1.5-14B-Chat 127.9 (±50.6) 0.500 0.604 0.080
360 0.3946 (±0.0201/√100) 🟢 Qwen/Qwen1.5-14B 130.9 (±67.8) 0.509 0.609 0.066
361 0.3945 (±0.0214/√100) 💬 sbintuitions/sarashina2.2-0.5b-instru... 435.0 (±169.2) 0.517 0.592 0.074
362 0.3934 (±0.0201/√100) 🟢 stabilityai/japanese-stablelm-instruc... 107.8 (±38.0) 0.466 0.648 0.066
363 0.3914 (±0.0172/√100) 🟢 mistralai/Mixtral-8x7B-Instruct-v0.1 95.1 (±25.2) 0.488 0.636 0.050
364 0.3863 (±0.0160/√100) 🟢 Qwen/Qwen1.5-14B-Chat 131.4 (±55.8) 0.491 0.593 0.075
365 0.3837 (±0.0188/√100) 🟢 rinna/bilingual-gpt-neox-4b-instructi... 117.4 (±42.4) 0.462 0.649 0.041
366 0.3828 (±0.0182/√100) 🟢 google/gemma-2-2b 112.5 (±25.6) 0.486 0.616 0.046
367 0.3823 (±0.0645/√100) 💬 mistralai/Mistral-Nemo-Instruct-2407 157.9 (±140.3) 0.484 0.563 0.100
368 0.3822 (±0.0647/√100) 💬 llm-jp/llm-jp-13b-instruct-full-dolly... 97.6 (±76.2) 0.397 0.664 0.086
369 0.3819 (±0.0265/√100) 🟢 google/gemma-2-27b 214.2 (±183.3) 0.450 0.608 0.087
370 0.3804 (±0.0161/√100) 🟢 Qwen/Qwen-7B-Chat 140.8 (±65.1) 0.485 0.612 0.045
371 0.3803 (±0.0249/√100) 💬 elyza/ELYZA-japanese-Llama-2-7b-instruct 136.4 (±70.7) 0.452 0.619 0.070
372 0.3777 (±0.0196/√100) 🟢 llm-jp/llm-jp-3-980m 101.6 (±20.5) 0.460 0.631 0.043
373 0.3772 (±0.0162/√100) 💬 microsoft/Phi-3-small-128k-instruct 199.7 (±111.9) 0.473 0.590 0.069
374 0.3760 (±0.0236/√100) 🟢 cyberagent/open-calm-3b 123.2 (±79.0) 0.442 0.624 0.062
375 0.3759 (±0.0149/√100) 🟢 lmsys/longchat-7b-v1.5-32k 116.9 (±31.6) 0.474 0.609 0.045
376 0.3740 (±0.0164/√100) 🟢 meta-llama/Llama-2-13b-hf 108.5 (±21.8) 0.474 0.603 0.045
377 0.3737 (±0.0197/√100) 🟢 meta-llama/Meta-Llama-3.1-8B-Instruct 204.5 (±303.4) 0.478 0.589 0.055
378 0.3728 (±0.0210/√100) 🟢 llm-jp/llm-jp-3-440m-instruct2 110.0 (±37.1) 0.455 0.625 0.040
379 0.3720 (±0.0622/√100) 💬 Xwin-LM/Xwin-LM-7B-V0.2 205.3 (±79.1) 0.466 0.590 0.060
380 0.3720 (±0.0157/√100) 🟢 elyza/ELYZA-japanese-Llama-2-13b-fast 177.5 (±147.2) 0.458 0.598 0.061
381 0.3699 (±0.0345/√100) 💬 Qwen/Qwen-7B-Chat 182.9 (±110.3) 0.468 0.600 0.042
382 0.3694 (±0.0103/√100) 🟢 google/gemma-7b-it 89.7 (±21.6) 0.446 0.640 0.022
383 0.3685 (±0.0173/√100) 🟢 elyza/ELYZA-japanese-Llama-2-7b 140.0 (±52.8) 0.462 0.596 0.047
384 0.3673 (±0.0089/√100) 💬 google/gemma-7b-it 110.0 (±47.6) 0.448 0.633 0.020
385 0.3655 (±0.0116/√100) 🟢 deepseek-ai/deepseek-llm-7b-chat 113.9 (±24.7) 0.474 0.579 0.043
386 0.3642 (±0.0165/√100) 🟢 llm-jp/llm-jp-1.3b-v1.0 134.0 (±62.6) 0.437 0.612 0.044
387 0.3637 (±0.0223/√100) 🟢 cyberagent/open-calm-large 122.3 (±73.9) 0.424 0.611 0.056
388 0.3637 (±0.0152/√100) 🟢 elyza/ELYZA-japanese-Llama-2-7b-fast 168.0 (±77.4) 0.452 0.587 0.052
389 0.3632 (±0.0237/√100) 💬 elyza/ELYZA-japanese-Llama-2-7b-fast-... 178.6 (±113.6) 0.443 0.582 0.064
390 0.3630 (±0.0234/√100) 🟢 llm-jp/llm-jp-3-440m-instruct3 115.2 (±40.1) 0.442 0.605 0.042
391 0.3628 (±0.0145/√100) 🟢 Qwen/Qwen-7B 117.3 (±39.0) 0.468 0.582 0.039
392 0.3611 (±0.0544/√100) 💬 llm-jp/llm-jp-3-440m-instruct2 244.7 (±154.0) 0.451 0.588 0.044
393 0.3589 (±0.0394/√100) 💬 llm-jp/llm-jp-3-440m-instruct3 286.6 (±158.5) 0.448 0.582 0.047
394 0.3554 (±0.0178/√100) 🟢 meta-llama/Llama-2-7b-chat-hf 139.3 (±93.1) 0.464 0.570 0.031
395 0.3545 (±0.0445/√100) 💬 llm-jp/llm-jp-13b-instruct-full-jaste... 48.8 (±50.1) 0.283 0.723 0.058
396 0.3543 (±0.0439/√100) 💬 lmsys/longchat-7b-v1.5-32k 160.1 (±73.5) 0.448 0.572 0.043
397 0.3538 (±0.0175/√100) 🟢 01-ai/Yi-1.5-9B 113.0 (±29.4) 0.457 0.555 0.050
398 0.3531 (±0.0159/√100) 🟢 mistralai/Mixtral-8x7B-v0.1 94.3 (±20.8) 0.450 0.573 0.037
399 0.3514 (±0.0102/√100) 🟢 google/gemma-1.1-2b-it 80.4 (±21.6) 0.404 0.625 0.025
400 0.3495 (±0.0268/√100) 🟢 cyberagent/open-calm-1b 141.3 (±110.0) 0.412 0.578 0.059
401 0.3477 (±0.0244/√100) 💬 llm-jp/llm-jp-3-440m-instruct2 432.3 (±161.3) 0.432 0.568 0.043
402 0.3471 (±0.0131/√100) 🟢 microsoft/Orca-2-7b 131.1 (±70.7) 0.447 0.555 0.039
403 0.3465 (±0.0202/√100) 💬 deepseek-ai/deepseek-llm-7b-chat 167.2 (±76.5) 0.435 0.562 0.042
404 0.3463 (±0.0178/√100) 💬 mistralai/Mixtral-8x7B-Instruct-v0.1 147.1 (±111.8) 0.448 0.548 0.043
405 0.3449 (±0.0986/√100) 💬 stabilityai/japanese-stablelm-instruc... 109.4 (±66.2) 0.397 0.585 0.053
406 0.3440 (±0.0978/√100) 💬 stabilityai/japanese-stablelm-3b-4e1t... 127.8 (±80.5) 0.401 0.576 0.055
407 0.3436 (±0.0126/√100) 💬 01-ai/Yi-1.5-9B-Chat 143.6 (±60.1) 0.438 0.540 0.053
408 0.3428 (±0.0163/√100) 🟢 meta-llama/Llama-2-7b-hf 112.3 (±28.0) 0.440 0.550 0.038
409 0.3408 (±0.0225/√100) 🟢 anthracite-org/magnum-32b-v2 191.9 (±223.2) 0.442 0.507 0.073
410 0.3393 (±0.0225/√100) 🟢 stockmark/gpt-neox-japanese-1.4b 92.2 (±63.7) 0.351 0.641 0.025
411 0.3338 (±0.0493/√100) 🟢 SakanaAI/TinySwallow-1.5B 142.2 (±109.9) 0.415 0.534 0.052
412 0.3322 (±0.0151/√100) 🟢 Qwen/Qwen1.5-7B-Chat 127.7 (±117.0) 0.431 0.520 0.045
413 0.3320 (±0.0170/√100) 🟢 Qwen/Qwen2.5-1.5B 117.7 (±41.6) 0.431 0.533 0.032
414 0.3315 (±0.0203/√100) 🟢 Qwen/Qwen1.5-7B 141.8 (±126.5) 0.445 0.504 0.046
415 0.3313 (±0.0115/√100) 🟢 google/gemma-2b-it 85.9 (±24.7) 0.393 0.577 0.024
416 0.3293 (±0.0252/√100) 💬 Qwen/Qwen1.5-7B-Chat 195.7 (±113.1) 0.429 0.503 0.056
417 0.3276 (±0.0709/√100) 💬 elyza/ELYZA-japanese-Llama-2-13b-fast... 134.0 (±98.8) 0.395 0.543 0.045
418 0.3272 (±0.0101/√100) 💬 01-ai/Yi-1.5-6B-Chat 194.4 (±75.0) 0.426 0.530 0.025
419 0.3209 (±0.0175/√100) 💬 llm-jp/llm-jp-3-440m-instruct3 375.9 (±168.6) 0.391 0.533 0.039
420 0.3199 (±0.0181/√100) 🟢 llm-jp/llm-jp-3-440m 110.0 (±33.4) 0.390 0.543 0.027
421 0.3187 (±0.0142/√100) 🟢 Qwen/Qwen2-1.5B-Instruct 131.4 (±46.7) 0.421 0.513 0.022
422 0.3172 (±0.0150/√100) 🟢 Qwen/Qwen2-1.5B 120.9 (±30.7) 0.422 0.511 0.019
423 0.3161 (±0.0119/√100) 🟢 deepseek-ai/deepseek-llm-7b-base 113.7 (±21.6) 0.424 0.501 0.024
424 0.3147 (±0.0175/√100) 💬 Qwen/Qwen2-1.5B-Instruct 180.7 (±101.0) 0.408 0.511 0.025
425 0.3078 (±0.0195/√100) 🟢 cyberagent/open-calm-medium 117.3 (±59.4) 0.363 0.537 0.024
426 0.3058 (±0.1106/√100) 💬 rinna/nekomata-7b-instruction 61.2 (±57.0) 0.307 0.567 0.043
427 0.3053 (±0.0177/√100) 🟢 google/gemma-2b 151.5 (±113.6) 0.410 0.480 0.026
428 0.3050 (±0.0190/√100) 🟢 Qwen/Qwen1.5-MoE-A2.7B 146.4 (±90.3) 0.412 0.468 0.035
429 0.2993 (±0.0095/√100) 🟢 01-ai/Yi-1.5-6B-Chat 133.3 (±46.2) 0.394 0.481 0.022
430 0.2993 (±0.0107/√100) 🟢 tiiuae/falcon-11B 121.6 (±31.5) 0.398 0.483 0.016
431 0.2957 (±0.0641/√100) 💬 meta-llama/Llama-2-13b-chat-hf 305.2 (±299.7) 0.402 0.453 0.032
432 0.2953 (±0.0442/√100) 🟢 augmxnt/shisa-base-7b-v1 200.4 (±160.3) 0.378 0.478 0.030
433 0.2924 (±0.0506/√100) 💬 Qwen/Qwen1.5-MoE-A2.7B-Chat 245.1 (±209.1) 0.381 0.453 0.043
434 0.2914 (±0.0133/√100) 🟢 mistralai/Mistral-7B-v0.1 117.4 (±40.4) 0.402 0.454 0.018
435 0.2907 (±0.0175/√100) 🟢 Qwen/Qwen1.5-MoE-A2.7B-Chat 149.8 (±91.0) 0.388 0.448 0.036
436 0.2900 (±0.0226/√100) 💬 llm-jp/llm-jp-3-150m-instruct2 421.0 (±181.6) 0.365 0.485 0.020
437 0.2869 (±0.0214/√100) 🟢 llm-jp/llm-jp-3-150m-instruct2 108.9 (±41.1) 0.342 0.498 0.021
438 0.2853 (±0.0163/√100) 🟢 Qwen/Qwen1.5-4B-Chat 127.8 (±71.2) 0.395 0.441 0.019
439 0.2809 (±0.0133/√100) 🟢 Qwen/Qwen1.5-1.8B-Chat 178.3 (±92.0) 0.381 0.445 0.017
440 0.2799 (±0.0233/√100) 🟢 llm-jp/llm-jp-3-150m-instruct3 121.5 (±43.8) 0.340 0.478 0.022
441 0.2785 (±0.0179/√100) 💬 llm-jp/llm-jp-3-150m-instruct3 412.9 (±178.5) 0.344 0.470 0.021
442 0.2770 (±0.0131/√100) 🟢 mistralai/Mistral-7B-Instruct-v0.2 146.2 (±70.1) 0.387 0.419 0.024
443 0.2769 (±0.0324/√100) 💬 llm-jp/llm-jp-13b-instruct-full-jaste... 16.9 (±24.6) 0.125 0.693 0.013
444 0.2769 (±0.1029/√100) 💬 stabilityai/japanese-stablelm-instruc... 117.0 (±115.0) 0.307 0.489 0.035
445 0.2666 (±0.0241/√100) 🟢 deepseek-ai/deepseek-llm-67b-chat 140.2 (±83.0) 0.351 0.440 0.009
446 0.2661 (±0.0128/√100) 🟢 Qwen/Qwen1.5-1.8B 129.7 (±65.7) 0.360 0.424 0.014
447 0.2631 (±0.0168/√100) 🟢 Qwen/Qwen2.5-0.5B 126.3 (±53.1) 0.355 0.422 0.013
448 0.2613 (±0.0136/√100) 🟢 Qwen/Qwen2-0.5B-Instruct 176.8 (±98.9) 0.351 0.426 0.007
449 0.2604 (±0.0148/√100) 🟢 mistralai/Mistral-7B-Instruct-v0.1 139.8 (±101.3) 0.367 0.400 0.014
450 0.2598 (±0.0129/√100) 🟢 Qwen/Qwen2-0.5B 122.7 (±43.5) 0.350 0.420 0.009
451 0.2581 (±0.0196/√100) 🟢 cyberagent/open-calm-small 119.1 (±54.1) 0.310 0.460 0.004
452 0.2555 (±0.0163/√100) 🟢 Qwen/Qwen1.5-4B 149.2 (±76.6) 0.363 0.388 0.015
453 0.2543 (±0.0266/√100) 🟢 mosaicml/mpt-30b-chat 121.3 (±46.4) 0.327 0.428 0.008
454 0.2446 (±0.0204/√100) 🟢 llm-jp/llm-jp-3-150m 107.6 (±41.1) 0.297 0.427 0.009
455 0.2442 (±0.0589/√100) 💬 llm-jp/llm-jp-3-150m-instruct2 256.2 (±198.3) 0.304 0.410 0.019
456 0.2414 (±0.0281/√100) 💬 Qwen/Qwen1.5-1.8B-Chat 480.0 (±210.3) 0.329 0.392 0.003
457 0.2394 (±0.0745/√100) 💬 Qwen/Qwen1.5-4B-Chat 105.3 (±104.1) 0.307 0.390 0.021
458 0.2317 (±0.0455/√100) 💬 mistralai/Mistral-7B-Instruct-v0.1 202.3 (±153.9) 0.320 0.362 0.012
459 0.2231 (±0.0166/√100) 💬 mistralai/Mistral-7B-Instruct-v0.2 261.2 (±166.3) 0.316 0.334 0.019
460 0.2182 (±0.0152/√100) 🟢 microsoft/phi-1 47.6 (±34.3) 0.234 0.420 0.000
461 0.2177 (±0.0110/√100) 🟢 Qwen/Qwen1.5-0.5B-Chat 143.4 (±52.1) 0.317 0.327 0.009
462 0.2169 (±0.0561/√100) 💬 Qwen/Qwen2-0.5B-Instruct 129.5 (±114.3) 0.265 0.379 0.006
463 0.2169 (±0.0218/√100) 🟢 mosaicml/mpt-30b-instruct 109.8 (±36.1) 0.274 0.370 0.008
464 0.2146 (±0.0151/√100) 🟢 microsoft/phi-2 78.0 (±31.4) 0.287 0.356 0.001
465 0.2061 (±0.0820/√100) 💬 meta-llama/Llama-2-70b-chat-hf 523.3 (±444.5) 0.271 0.303 0.045
466 0.2040 (±0.0152/√100) 🟢 Qwen/Qwen1.5-0.5B 138.6 (±55.9) 0.296 0.314 0.003
467 0.2038 (±0.0538/√100) 🟢 mosaicml/mpt-30b 236.5 (±433.3) 0.271 0.334 0.007
468 0.2004 (±0.0736/√100) 💬 llm-jp/llm-jp-3-150m-instruct3 296.9 (±240.0) 0.251 0.335 0.015
469 0.1885 (±0.0194/√100) 🟢 microsoft/phi-1_5 77.5 (±33.6) 0.258 0.306 0.001
470 0.1833 (±0.0406/√100) 💬 google/gemma-1.1-2b-it 32.6 (±26.7) 0.171 0.376 0.003
471 0.1765 (±0.0439/√100) 💬 Qwen/Qwen1.5-0.5B-Chat 214.3 (±172.6) 0.251 0.276 0.002
472 0.1687 (±0.0172/√100) 🟢 upstage/SOLAR-10.7B-v1.0 171.0 (±87.1) 0.265 0.237 0.004
473 0.1544 (±0.0132/√100) 🟢 01-ai/Yi-1.5-34B-Chat 730.0 (±533.6) 0.201 0.256 0.006
474 0.1475 (±0.0826/√100) 💬 mosaicml/mpt-30b-chat 112.2 (±112.4) 0.182 0.254 0.007
475 0.1241 (±0.0558/√100) 💬 google/gemma-2b-it 24.1 (±24.6) 0.115 0.257 0.000
476 0.1226 (±0.0240/√100) 🟢 Deci/DeciLM-7B 174.0 (±165.5) 0.190 0.174 0.003
477 0.1160 (±0.0081/√100) 🟢 rinna/japanese-gpt-neox-3.6b-instruct... 212.1 (±148.9) 0.153 0.195 0.000
478 0.1009 (±0.0846/√100) 💬 meta-llama/Llama-2-7b-chat-hf 241.5 (±336.2) 0.136 0.158 0.009
479 0.1004 (±0.0094/√100) 🟢 rinna/japanese-gpt-neox-3.6b-instruct... 123.1 (±128.8) 0.119 0.182 0.000
480 0.0987 (±0.0145/√100) 🟢 deepseek-ai/deepseek-llm-67b-base 154.2 (±77.3) 0.174 0.121 0.000
481 0.0982 (±0.1596/√100) 💬 rinna/nekomata-14b-instruction 16.0 (±38.1) 0.115 0.141 0.039
482 0.0955 (±0.0102/√100) 🟢 rinna/japanese-gpt-neox-3.6b-instruct... 129.5 (±141.0) 0.116 0.170 0.000
483 0.0939 (±0.0064/√100) 🟢 sbintuitions/tiny-lm-chat 250.2 (±275.6) 0.133 0.149 0.000
484 0.0936 (±0.0082/√100) 💬 sbintuitions/tiny-lm-chat 276.7 (±209.6) 0.135 0.145 0.000
485 0.0921 (±0.0058/√100) 🟢 sbintuitions/tiny-lm 471.9 (±199.0) 0.135 0.142 0.000
486 0.0880 (±0.0334/√100) 🟢 rinna/bilingual-gpt-neox-4b-instructi... 134.0 (±144.7) 0.105 0.159 0.000
487 0.0762 (±0.0033/√100) 🟢 line-corporation/japanese-large-lm-3.6b 1066.6 (±31.6) 0.125 0.103 0.000
488 0.0760 (±0.0032/√100) 🟢 line-corporation/japanese-large-lm-3.... 1066.4 (±31.8) 0.125 0.103 0.000
489 0.0758 (±0.0034/√100) 💬 line-corporation/japanese-large-lm-3.... 1067.2 (±31.8) 0.125 0.102 0.000
490 0.0673 (±0.0085/√100) 🟢 moneyforward/houou-instruction-7b-v3 143.2 (±112.2) 0.098 0.104 0.000
491 0.0625 (±0.0169/√100) 🟢 llm-jp/llm-jp-13b-instruct-full-ac_00... 31.6 (±10.3) 0.088 0.099 0.000
492 0.0429 (±0.0440/√100) 🟢 rinna/bilingual-gpt-neox-4b-instructi... 31.7 (±54.7) 0.045 0.084 0.000
493 0.0406 (±0.0028/√100) 🟢 microsoft/Phi-3-small-128k-instruct 268.1 (±123.4) 0.083 0.039 0.000
494 0.0337 (±0.0026/√100) 🟢 augmxnt/shisa-7b-v1 590.7 (±238.2) 0.076 0.025 0.000
495 0.0284 (±0.0012/√100) 🟢 lightblue/karasu-7B-chat-plus 285.1 (±53.8) 0.080 0.005 0.000
496 0.0225 (±0.0702/√100) 💬 SakanaAI/EvoLLM-JP-A-v1-7B 5.9 (±27.6) 0.026 0.037 0.005
497 0.0180 (±0.0039/√100) 🟢 mistralai/Mistral-Nemo-Base-2407 607.5 (±344.5) 0.039 0.015 0.000
498 0.0047 (±0.0024/√100) 🟢 ai-forever/mGPT-13B 321.1 (±266.7) 0.008 0.006 0.000
499 0.0022 (±0.0006/√100) 🟢 lightblue/qarasu-14B-chat-plus-unleashed 937.5 (±557.0) 0.004 0.002 0.000
500 0.0019 (±0.0002/√100) 🟢 01-ai/Yi-1.5-9B-Chat 1440.0 (±51.9) 0.005 0.001 0.000
501 0.0018 (±0.0004/√100) 🟢 CohereForAI/aya-23-8B 1676.6 (±351.0) 0.004 0.002 0.000
502 0.0006 (±0.0002/√100) 🟢 meta-llama/Llama-2-13b-chat-hf 1523.9 (±43.5) 0.001 0.001 0.000
503 0.0000 (±0.0000/√100) 🟢 01-ai/Yi-1.5-6B 0.0 (±0.0) 0.000 0.000 0.000
504 0.0000 (±0.0000/√100) 🟢 lightblue/karasu-1.1B 0.0 (±0.0) 0.000 0.000 0.000
505 0.0000 (±0.0000/√100) 🟢 lightblue/karasu-7B-chat-plus-unleashed 0.0 (±0.0) 0.000 0.000 0.000
506 0.0000 (±0.0000/√100) 🟢 lightblue/karasu-7B-chat 0.0 (±0.0) 0.000 0.000 0.000
507 0.0000 (±0.0000/√100) 🟢 lightblue/suzume-llama-3-8B-japanese 300.0 (±0.0) 0.000 0.000 0.000
508 0.0000 (±0.0000/√100) 🟢 lightblue/suzume-llama-3-8B-multilingual 300.0 (±0.0) 0.000 0.000 0.000

FAQ

What is the difference between the modes?

pfgen-bench provides three types of templates: completion, qa, and chat.

  • completion: No instruction is provided. It consists solely of question-answer pairs.
  • qa: An instruction is included at the beginning of the user message.
  • chat: An instruction is placed in a system message.

Should we control the temperature?

pfgen-bench recommends setting the temperature to 1.0.

Some tasks (e.g., generating dice rolls) require a temperature of 1.0, and setting a lower temperature often leads to unnatural repetition.

Citation

If you use this repository, please cite the following paper:

@preprint{Imos2024-pre-pfgen,
  title={{pfgen-bench: 日本語事前学習モデルのための文章生成性能評価ベンチマーク}},
  author={今城, 健太郎 and 平野, 正徳 and 鈴木, 脩司 and 三上, 裕明},
  doi={10.51094/jxiv.1008},
  year={2024}
}
@preprint{Imos2025-judge-free,
  title={{A Judge-free LLM Open-ended Generation Benchmark Based on the Distributional Hypothesis}},
  author={Kentaro Imajo and Masanori Hirano and Shuji Suzuki and Hiroaki Mikami},
  year={2025},
  eprint={2502.09316},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2502.09316},
  doi={10.48550/arXiv.2502.09316}
}

Or cite directory this repository:

@misc{imajo2024-pfgen
    title={{Preferred Generation Benchmark}},
    author={Kentaro Imajo and Masanori Hirano and Shuji Suzuki and Hiroaki Mikami},
    year={2024},
    url = {https://github.com/pfnet-research/pfgen-bench}
}