|
1 | 1 | # Overview
|
2 | 2 | This repository contains a lightweight library for evaluating language models.
|
3 |
| -We are open sourcing it so we can be transparent about the accuracy numbers we're publishing alongside our latest models (starting with `gpt-4-turbo-2024-04-09`). |
| 3 | +We are open sourcing it so we can be transparent about the accuracy numbers we're publishing alongside our latest models (starting with `gpt-4-turbo-2024-04-09` and `gpt-4o`). |
4 | 4 |
|
5 | 5 | Evals are sensitive to prompting, and there's significant variation in the formulations used in recent publications and libraries.
|
6 | 6 | Some use few-shot prompts or role playing prompts ("You are an expert software programmer...").
|
@@ -64,28 +64,38 @@ This will launch evaluations through the OpenAI API.
|
64 | 64 |
|
65 | 65 |
|
66 | 66 | ## Benchmark Results
|
67 |
| -| Model | Prompt |DROP(f1)| GPQA% | MATH% | MGSM% | MMLU% |HumanEval% | |
68 |
| -|:-----------------------------:|:-------------:|:------:|:-------:|:-------:|:-------:|:-------:|:---------:| |
69 |
| -| GPT4s | | | | | | | | |
70 |
| -| gpt-4-turbo-2024-04-09 | chatgpt[^1] | 85.4 | 49.1 | 72.2 | 88.6 | 86.5 | 87.6 | |
71 |
| -| gpt-4-turbo-2024-04-09 | assistant[^2] | 86.0 | 49.3 | 73.4 | 89.6 | 86.7 | 88.2 | |
72 |
| -| gpt-4-1106(-vision)-preview | chatgpt | 81.3 | 42.1 | 64.1 | 86.5 | 84.6 | 82.2 | |
73 |
| -| gpt-4-1106(-vision)-preview | assistant | 83.2 | 42.5 | 64.3 | 87.1 | 84.7 | 83.7 | |
74 |
| -| gpt-4-0125-preview | chatgpt | 83.4 | 39.7 | 64.2 | 83.7 | 84.8 | 88.2 | |
75 |
| -| gpt-4-0125-preview | assistant | 81.5 | 41.4 | 64.5 | 85.1 | 85.4 | 86.6 | |
76 |
| -| REFERENCE | | | | | | | |
77 |
| -| Claude-3-Opus (rerun w/ api) | empty[^3] | 79.0 | 49.7 | 63.2 | 89.7 | 84.1 | 84.8 | |
78 |
| -| Claude-3-Opus (rerun w/ api) | lmsys[^4] | 77.1 | 50.7 | 63.8 | 89.2 | 84.2 | 82.9 | |
79 |
| -| Claude-3-Opus (report[^5]) | unknown | 83.1 | 50.4 | 60.1 | 90.7 | 86.8 | 84.9 | |
80 |
| -| Gemini-Ultra-1.0 (report[^6]) | unknown | 82.4 | n/a | 53.2 | 79.0 | 83.7 | 74.4 | |
81 |
| -| Gemini-Pro-1.5 (report[^6]) | unknown | 78.9 | n/a | 58.5 | 88.7 | 81.9 | 71.9 | |
| 67 | +| Model | Prompt | MMLU | GPQA | MATH | HumanEval| MGSM | DROP<br>(F1,3-shot) | |
| 68 | +|:----------------------------|:-------------:|:------:|:-------:|:------:|:--------:|:------:|:------:| |
| 69 | +| OPENAI GPT4s | | | | | | | | |
| 70 | +| gpt-4o | chatgpt[^1] |**`88.7`**|**`53.6`**|**`76.6`**| 90.2| 90.5 | 83.4 | |
| 71 | +| gpt-4o | assistant[^2] | 87.2 | 49.9 |**`76.6`**|**`91.0`**| 89.9 | 83.7 | |
| 72 | +| gpt-4-turbo-2024-04-09 | chatgpt | 86.5 | 49.1 | 72.2 | 87.6 | 88.6 | 85.4 | |
| 73 | +| gpt-4-turbo-2024-04-09 | assistant | 86.7 | 49.3 | 73.4 | 88.2 | 89.6 |**`86.0`**| |
| 74 | +| gpt-4-1106(-vision)-preview | chatgpt | 84.6 | 42.1 | 64.1 | 82.2 | 86.5 | 81.3 | |
| 75 | +| gpt-4-1106(-vision)-preview | assistant | 84.7 | 42.5 | 64.3 | 83.7 | 87.1 | 83.2 | |
| 76 | +| gpt-4-0125-preview | chatgpt | 84.8 | 39.7 | 64.2 | 88.2 | 83.7 | 83.4 | |
| 77 | +| gpt-4-0125-preview | assistant | 85.4 | 41.4 | 64.5 | 86.6 | 85.1 | 81.5 | |
| 78 | +| REFERENCE-RERUN | | | | | | | | |
| 79 | +| Claude-3-Opus (rerun w/ api) | empty[^3] | 84.1 | 49.7 | 63.2 | 84.8 | 89.7 | 79.0 | |
| 80 | +| Claude-3-Opus (rerun w/ api) | lmsys[^4] | 84.2 | 50.7 | 63.8 | 82.9 | 89.2 | 77.1 | |
| 81 | +| Llama3 70b (rerun w/ api) | empty | 80.2 | 41.3 | 52.8 | 70.1 | 82.6 | 81.4 | |
| 82 | +| REFERENCE-REPORT | |(5-shot)| | | | | | |
| 83 | +| Claude-3-Opus (report[^5]) | unknown | 86.8 | 50.4 | 60.1 | 84.9 |**`90.7`**| 83.1 | |
| 84 | +| Gemini-Ultra-1.0 (report[^6])| unknown | 83.7 | n/a | 53.2 | 74.4 | 79.0 | 82.4 | |
| 85 | +| Gemini-Pro-1.5 (report[^6]) | unknown | 81.9 | n/a | 58.5 | 71.9 | 88.7 | 78.9 | |
| 86 | +| Llama3 8b (report[^7]) | unknown | 68.4 | 34.2 | 30.0 | 62.2 | n/a | 58.4 | |
| 87 | +| Llama3 70b (report[^7]) | unknown | 82.0 | 39.5 | 50.4 | 81.7 | n/a | 79.7 | |
| 88 | +| Llama3 400b (still training, report[^7])| unknown | 86.1 | 48.0 | 57.8 | 84.1 | n/a | 83.5 | |
| 89 | + |
82 | 90 |
|
83 | 91 | [^1]:chatgpt system message: "You are ChatGPT, a large language model trained by OpenAI, based on the GPT-4 architecture.\nKnowledge cutoff: 2023-12\nCurrent date: 2024-04-01"
|
84 | 92 | [^2]:assistant system message in [OpenAI API doc](https://platform.openai.com/docs/api-reference/introduction): "You are a helpful assistant." .
|
85 | 93 | [^3]:claude-3 empty system message: suggested by Anthropic API doc, and we have done limited experiments due to [rate limit](https://docs.anthropic.com/claude/reference/rate-limits) issues, but we welcome PRs with alternative choices.
|
86 | 94 | [^4]:claude-3 lmsys system message: system message in LMSYS [Fast-chat open source code](https://github.com/lm-sys/FastChat/blob/7899355ebe32117fdae83985cf8ee476d2f4243f/fastchat/conversation.py#L894): "The assistant is Claude, created by Anthropic. The current date is {{currentDateTime}}. Claude's knowledge base was last updated ... ". We have done limited experiments due to [rate limit](https://docs.anthropic.com/claude/reference/rate-limits) issues, but we welcome PRs with alternative choices.
|
87 | 95 | [^5]:claude-3 reports: [https://www.anthropic.com/news/claude-3-family](https://www.anthropic.com/news/claude-3-family).
|
88 |
| -[^6]:gemini-1.5 reports: [https://blog.google/technology/ai/google-gemini-next-generation-model-february-2024/](https://blog.google/technology/ai/google-gemini-next-generation-model-february-2024/), we dont have rerun results due to [rate_limit](https://ai.google.dev/pricing) issues and paid-as-you-go version are still "coming soon" by the time of this study on 04/02. |
| 96 | +[^6]:gemini-1.5 reports: [https://goo.gle/GeminiV1-5](https://goo.gle/GeminiV1-5), we dont have rerun results due to [rate_limit](https://ai.google.dev/pricing) issues and paid-as-you-go version are still "coming at May 14" by the time of this study on 05/11. |
| 97 | +[^7]:Llama 3 tech report: [https://ai.meta.com/blog/meta-llama-3/](https://ai.meta.com/blog/meta-llama-3/). Note Llama 400b is still training and these numbers are based on the best of their pretrain/instruct Llama 400b numbers. |
| 98 | + |
89 | 99 |
|
90 | 100 | ## Legal Stuff
|
91 |
| -By contributing to evals, you are agreeing to make your evaluation logic and data under the same MIT license as this repository. You must have adequate rights to upload any data used in an eval. OpenAI reserves the right to use this data in future service improvements to our product. Contributions to OpenAI evals will be subject to our usual Usage Policies: https://platform.openai.com/docs/usage-policies. |
| 101 | +By contributing to evals, you are agreeing to make your evaluation logic and data under the same MIT license as this repository. You must have adequate rights to upload any data used in an eval. OpenAI reserves the right to use this data in future service improvements to our product. Contributions to OpenAI evals will be subject to our usual Usage Policies: https://platform.openai.com/docs/usage-policies. |
0 commit comments