Skip to content

Commit 294cb1f

Browse files
author
openai
committed
update readme and evals for gpt-4o
1 parent 267835b commit 294cb1f

12 files changed

+215
-161
lines changed

README.md

+28-18
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Overview
22
This repository contains a lightweight library for evaluating language models.
3-
We are open sourcing it so we can be transparent about the accuracy numbers we're publishing alongside our latest models (starting with `gpt-4-turbo-2024-04-09`).
3+
We are open sourcing it so we can be transparent about the accuracy numbers we're publishing alongside our latest models (starting with `gpt-4-turbo-2024-04-09` and `gpt-4o`).
44

55
Evals are sensitive to prompting, and there's significant variation in the formulations used in recent publications and libraries.
66
Some use few-shot prompts or role playing prompts ("You are an expert software programmer...").
@@ -64,28 +64,38 @@ This will launch evaluations through the OpenAI API.
6464

6565

6666
## Benchmark Results
67-
| Model | Prompt |DROP(f1)| GPQA% | MATH% | MGSM% | MMLU% |HumanEval% |
68-
|:-----------------------------:|:-------------:|:------:|:-------:|:-------:|:-------:|:-------:|:---------:|
69-
| GPT4s | | | | | | | |
70-
| gpt-4-turbo-2024-04-09 | chatgpt[^1] | 85.4 | 49.1 | 72.2 | 88.6 | 86.5 | 87.6 |
71-
| gpt-4-turbo-2024-04-09 | assistant[^2] | 86.0 | 49.3 | 73.4 | 89.6 | 86.7 | 88.2 |
72-
| gpt-4-1106(-vision)-preview | chatgpt | 81.3 | 42.1 | 64.1 | 86.5 | 84.6 | 82.2 |
73-
| gpt-4-1106(-vision)-preview | assistant | 83.2 | 42.5 | 64.3 | 87.1 | 84.7 | 83.7 |
74-
| gpt-4-0125-preview | chatgpt | 83.4 | 39.7 | 64.2 | 83.7 | 84.8 | 88.2 |
75-
| gpt-4-0125-preview | assistant | 81.5 | 41.4 | 64.5 | 85.1 | 85.4 | 86.6 |
76-
| REFERENCE | | | | | | |
77-
| Claude-3-Opus (rerun w/ api) | empty[^3] | 79.0 | 49.7 | 63.2 | 89.7 | 84.1 | 84.8 |
78-
| Claude-3-Opus (rerun w/ api) | lmsys[^4] | 77.1 | 50.7 | 63.8 | 89.2 | 84.2 | 82.9 |
79-
| Claude-3-Opus (report[^5]) | unknown | 83.1 | 50.4 | 60.1 | 90.7 | 86.8 | 84.9 |
80-
| Gemini-Ultra-1.0 (report[^6]) | unknown | 82.4 | n/a | 53.2 | 79.0 | 83.7 | 74.4 |
81-
| Gemini-Pro-1.5 (report[^6]) | unknown | 78.9 | n/a | 58.5 | 88.7 | 81.9 | 71.9 |
67+
| Model | Prompt | MMLU | GPQA | MATH | HumanEval| MGSM | DROP<br>(F1,3-shot) |
68+
|:----------------------------|:-------------:|:------:|:-------:|:------:|:--------:|:------:|:------:|
69+
| OPENAI GPT4s | | | | | | | |
70+
| gpt-4o | chatgpt[^1] |**`88.7`**|**`53.6`**|**`76.6`**| 90.2| 90.5 | 83.4 |
71+
| gpt-4o | assistant[^2] | 87.2 | 49.9 |**`76.6`**|**`91.0`**| 89.9 | 83.7 |
72+
| gpt-4-turbo-2024-04-09 | chatgpt | 86.5 | 49.1 | 72.2 | 87.6 | 88.6 | 85.4 |
73+
| gpt-4-turbo-2024-04-09 | assistant | 86.7 | 49.3 | 73.4 | 88.2 | 89.6 |**`86.0`**|
74+
| gpt-4-1106(-vision)-preview | chatgpt | 84.6 | 42.1 | 64.1 | 82.2 | 86.5 | 81.3 |
75+
| gpt-4-1106(-vision)-preview | assistant | 84.7 | 42.5 | 64.3 | 83.7 | 87.1 | 83.2 |
76+
| gpt-4-0125-preview | chatgpt | 84.8 | 39.7 | 64.2 | 88.2 | 83.7 | 83.4 |
77+
| gpt-4-0125-preview | assistant | 85.4 | 41.4 | 64.5 | 86.6 | 85.1 | 81.5 |
78+
| REFERENCE-RERUN | | | | | | | |
79+
| Claude-3-Opus (rerun w/ api) | empty[^3] | 84.1 | 49.7 | 63.2 | 84.8 | 89.7 | 79.0 |
80+
| Claude-3-Opus (rerun w/ api) | lmsys[^4] | 84.2 | 50.7 | 63.8 | 82.9 | 89.2 | 77.1 |
81+
| Llama3 70b (rerun w/ api) | empty | 80.2 | 41.3 | 52.8 | 70.1 | 82.6 | 81.4 |
82+
| REFERENCE-REPORT | |(5-shot)| | | | | |
83+
| Claude-3-Opus (report[^5]) | unknown | 86.8 | 50.4 | 60.1 | 84.9 |**`90.7`**| 83.1 |
84+
| Gemini-Ultra-1.0 (report[^6])| unknown | 83.7 | n/a | 53.2 | 74.4 | 79.0 | 82.4 |
85+
| Gemini-Pro-1.5 (report[^6]) | unknown | 81.9 | n/a | 58.5 | 71.9 | 88.7 | 78.9 |
86+
| Llama3 8b (report[^7]) | unknown | 68.4 | 34.2 | 30.0 | 62.2 | n/a | 58.4 |
87+
| Llama3 70b (report[^7]) | unknown | 82.0 | 39.5 | 50.4 | 81.7 | n/a | 79.7 |
88+
| Llama3 400b (still training, report[^7])| unknown | 86.1 | 48.0 | 57.8 | 84.1 | n/a | 83.5 |
89+
8290

8391
[^1]:chatgpt system message: "You are ChatGPT, a large language model trained by OpenAI, based on the GPT-4 architecture.\nKnowledge cutoff: 2023-12\nCurrent date: 2024-04-01"
8492
[^2]:assistant system message in [OpenAI API doc](https://platform.openai.com/docs/api-reference/introduction): "You are a helpful assistant." .
8593
[^3]:claude-3 empty system message: suggested by Anthropic API doc, and we have done limited experiments due to [rate limit](https://docs.anthropic.com/claude/reference/rate-limits) issues, but we welcome PRs with alternative choices.
8694
[^4]:claude-3 lmsys system message: system message in LMSYS [Fast-chat open source code](https://github.com/lm-sys/FastChat/blob/7899355ebe32117fdae83985cf8ee476d2f4243f/fastchat/conversation.py#L894): "The assistant is Claude, created by Anthropic. The current date is {{currentDateTime}}. Claude's knowledge base was last updated ... ". We have done limited experiments due to [rate limit](https://docs.anthropic.com/claude/reference/rate-limits) issues, but we welcome PRs with alternative choices.
8795
[^5]:claude-3 reports: [https://www.anthropic.com/news/claude-3-family](https://www.anthropic.com/news/claude-3-family).
88-
[^6]:gemini-1.5 reports: [https://blog.google/technology/ai/google-gemini-next-generation-model-february-2024/](https://blog.google/technology/ai/google-gemini-next-generation-model-february-2024/), we dont have rerun results due to [rate_limit](https://ai.google.dev/pricing) issues and paid-as-you-go version are still "coming soon" by the time of this study on 04/02.
96+
[^6]:gemini-1.5 reports: [https://goo.gle/GeminiV1-5](https://goo.gle/GeminiV1-5), we dont have rerun results due to [rate_limit](https://ai.google.dev/pricing) issues and paid-as-you-go version are still "coming at May 14" by the time of this study on 05/11.
97+
[^7]:Llama 3 tech report: [https://ai.meta.com/blog/meta-llama-3/](https://ai.meta.com/blog/meta-llama-3/). Note Llama 400b is still training and these numbers are based on the best of their pretrain/instruct Llama 400b numbers.
98+
8999

90100
## Legal Stuff
91-
By contributing to evals, you are agreeing to make your evaluation logic and data under the same MIT license as this repository. You must have adequate rights to upload any data used in an eval. OpenAI reserves the right to use this data in future service improvements to our product. Contributions to OpenAI evals will be subject to our usual Usage Policies: https://platform.openai.com/docs/usage-policies.
101+
By contributing to evals, you are agreeing to make your evaluation logic and data under the same MIT license as this repository. You must have adequate rights to upload any data used in an eval. OpenAI reserves the right to use this data in future service improvements to our product. Contributions to OpenAI evals will be subject to our usual Usage Policies: https://platform.openai.com/docs/usage-policies.

common.py

+101-2
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,106 @@
77
import numpy as np
88
from tqdm import tqdm
99

10-
from .types import EvalResult, Message, SingleEvalResult
10+
from .types import EvalResult, Message, SamplerBase, SingleEvalResult
11+
12+
QUERY_TEMPLATE_MULTICHOICE = """
13+
Answer the following multiple choice question. The last line of your response should be of the following format: 'Answer: $LETTER' (without quotes) where LETTER is one of ABCD. Think step by step before answering.
14+
15+
{Question}
16+
17+
A) {A}
18+
B) {B}
19+
C) {C}
20+
D) {D}
21+
""".strip()
22+
23+
ANSWER_PATTERN_MULTICHOICE = r"(?i)Answer\s*:\s*([A-D])"
24+
ANSWER_PATTERN = r"(?i)Answer\s*:\s*([^\n]+)"
25+
26+
27+
EQUALITY_TEMPLATE = r"""
28+
Look at the following two expressions (answers to a math problem) and judge whether they are equivalent. Only perform trivial simplifications
29+
30+
Examples:
31+
32+
Expression 1: $2x+3$
33+
Expression 2: $3+2x$
34+
35+
Yes
36+
37+
Expression 1: 3/2
38+
Expression 2: 1.5
39+
40+
Yes
41+
42+
Expression 1: $x^2+2x+1$
43+
Expression 2: $y^2+2y+1$
44+
45+
No
46+
47+
Expression 1: $x^2+2x+1$
48+
Expression 2: $(x+1)^2$
49+
50+
Yes
51+
52+
Expression 1: 3245/5
53+
Expression 2: 649
54+
55+
No
56+
(these are actually equal, don't mark them equivalent if you need to do nontrivial simplifications)
57+
58+
Expression 1: 2/(-3)
59+
Expression 2: -2/3
60+
61+
Yes
62+
(trivial simplifications are allowed)
63+
64+
Expression 1: 72 degrees
65+
Expression 2: 72
66+
67+
Yes
68+
(give benefit of the doubt to units)
69+
70+
Expression 1: 64
71+
Expression 2: 64 square feet
72+
73+
Yes
74+
(give benefit of the doubt to units)
75+
76+
---
77+
78+
YOUR TASK
79+
80+
81+
Respond with only "Yes" or "No" (without quotes). Do not include a rationale.
82+
83+
Expression 1: %(expression1)s
84+
Expression 2: %(expression2)s
85+
""".strip()
86+
87+
88+
HTML_JINJA = """
89+
<h3>Prompt conversation</h3>
90+
{% for message in prompt_messages %}
91+
{{ message_to_html(message) | safe }}
92+
{% endfor %}
93+
<h3>Sampled message</h3>
94+
{{ message_to_html(next_message) | safe }}
95+
<h3>Results</h3>
96+
<p>Correct Answer: {{ correct_answer }}</p>
97+
<p>Extracted Answer: {{ extracted_answer }}</p>
98+
<p>Score: {{ score }}</p>
99+
"""
100+
101+
102+
def format_multichoice_question(row):
103+
return QUERY_TEMPLATE_MULTICHOICE.format(**row)
104+
105+
106+
def check_equality(sampler: SamplerBase, expr1: str, expr2: str):
107+
prompt = EQUALITY_TEMPLATE % {"expression1": expr1, "expression2": expr2}
108+
response = sampler([dict(content=prompt, role="user")])
109+
return response.lower().strip() == "yes"
11110

12111

13112
def _compute_stat(values: list, stat: str):
@@ -152,7 +251,7 @@ def message_to_html(message: Message) -> str:
152251
{% endif %}
153252
<h1>Examples</h1>
154253
{% for html in htmls %}
155-
{{ html }}
254+
{{ html | safe }}
156255
<hr>
157256
{% endfor %}
158257
</body>

demo.py

+11-1
Original file line numberDiff line numberDiff line change
@@ -31,15 +31,25 @@ def main():
3131
model="gpt-4-turbo-2024-04-09",
3232
system_message=OPENAI_SYSTEM_MESSAGE_CHATGPT,
3333
),
34+
"gpt-4o_assistant": ChatCompletionSampler(
35+
model="gpt-4o",
36+
system_message=OPENAI_SYSTEM_MESSAGE_API,
37+
max_tokens=2048,
38+
),
39+
"gpt-4o_chatgpt": ChatCompletionSampler(
40+
model="gpt-4o",
41+
system_message=OPENAI_SYSTEM_MESSAGE_CHATGPT,
42+
max_tokens=2048,
43+
),
3444
# claude models:
3545
# "claude-3-opus-20240229_empty": ClaudeCompletionSampler(
3646
# model="claude-3-opus-20240229", system_message=None,
3747
# ),
3848
}
3949

4050
equality_checker = ChatCompletionSampler(model="gpt-4-turbo-preview")
41-
4251
# ^^^ used for fuzzy matching, just for math
52+
4353
def get_evals(eval_name):
4454
match eval_name:
4555
case "mmlu":

drop_eval.py

+2-4
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@
1616
from scipy.optimize import linear_sum_assignment
1717

1818
from . import common
19-
from .mmlu_eval import HTML_JINJA
19+
from .common import ANSWER_PATTERN, HTML_JINJA
2020
from .types import Eval, EvalResult, SamplerBase, SingleEvalResult
2121

2222
"""
@@ -28,8 +28,6 @@
2828
/eval/drop_eval.py
2929
"""
3030

31-
ANSWER_PATTERN = r"(?i)Answer\s*:\s*([^\n]+)"
32-
3331

3432
def _remove_articles(text: str) -> str:
3533
regex = re.compile(r"\b(a|an|the)\b", re.UNICODE)
@@ -282,7 +280,7 @@ def fn(example: dict[str, str]):
282280
prompt += """\n
283281
Think step by step, then write a line of the form "Answer: $ANSWER" at the end of your response.
284282
"""
285-
prompt_messages = [dict(content=prompt, role="user")]
283+
prompt_messages = [sampler._pack_message(content=prompt, role="user")]
286284
response_text = sampler(prompt_messages)
287285
match = re.search(ANSWER_PATTERN, response_text)
288286
extracted_answer = match.group(1) if match else response_text

gpqa_eval.py

+7-7
Original file line numberDiff line numberDiff line change
@@ -11,14 +11,10 @@
1111
import pandas
1212

1313
from . import common
14-
from .mmlu_eval import ANSWER_PATTERN, HTML_JINJA, QUERY_TEMPLATE
14+
from .common import ANSWER_PATTERN_MULTICHOICE, HTML_JINJA, format_multichoice_question
1515
from .types import Eval, EvalResult, MessageList, SamplerBase, SingleEvalResult
1616

1717

18-
def format_question(row):
19-
return QUERY_TEMPLATE.format(**row)
20-
21-
2218
class GPQAEval(Eval):
2319
def __init__(
2420
self,
@@ -55,9 +51,13 @@ def fn(row: dict):
5551
choices_dict = dict(
5652
A=choices[0], B=choices[1], C=choices[2], D=choices[3], Question=row["Question"]
5753
)
58-
prompt_messages = [dict(content=format_question(choices_dict), role="user")]
54+
prompt_messages = [
55+
sampler._pack_message(
56+
content=format_multichoice_question(choices_dict), role="user"
57+
)
58+
]
5959
response_text = sampler(prompt_messages)
60-
match = re.search(ANSWER_PATTERN, response_text)
60+
match = re.search(ANSWER_PATTERN_MULTICHOICE, response_text)
6161
extracted_answer = match.group(1) if match else None
6262
score = 1.0 if extracted_answer == correct_answer else 0.0
6363
html = common.jinja_env.from_string(HTML_JINJA).render(

humaneval_eval.py

+4-2
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@
2121
from human_eval.execution import check_correctness # , unsafe_execute
2222

2323
from . import common
24-
from .mmlu_eval import HTML_JINJA
24+
from .common import HTML_JINJA
2525
from .types import Eval, EvalResult, SamplerBase, SingleEvalResult
2626

2727

@@ -84,7 +84,9 @@ def find_code(completion):
8484
return extracted_answer
8585

8686
def fn(sample: dict[str, str]):
87-
prompt_messages = [{"role": "user", "content": instruction + sample["prompt"]}]
87+
prompt_messages = [
88+
sampler._pack_mesage(role="user", content=instruction + sample["prompt"])
89+
]
8890
completions = [
8991
find_code(sampler(prompt_messages)) for _ in range(self._num_samples_per_task)
9092
]

0 commit comments

Comments
 (0)