Skip to content

Commit a43321f

Browse files
Merge pull request #244 from trishullab/george
Update miscellaneous data
2 parents 7f01461 + 5adef23 commit a43321f

File tree

4 files changed

+22
-12
lines changed

4 files changed

+22
-12
lines changed

README.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# PutnamBench
1+
# [PutnamBench](https://arxiv.org/abs/2407.11214)
22

33
<p align="center">
44
<a href="https://trishullab.github.io/PutnamBench/leaderboard.html"><img src="https://img.shields.io/badge/%F0%9F%8F%86-leaderboard-%23ff8811"></a>
@@ -32,9 +32,9 @@ We also report the number of problems in a certain category. Note that some prob
3232
| Number Theory | 108 |
3333
| Geometry | 69 |
3434
| Linear Algebra | 51 |
35-
| Abstract Algebra | 28 |
3635
| Combinatorics | 29 |
37-
| Probability | 10 |
36+
| Abstract Algebra | 28 |
37+
| Probability | 10 |
3838
| Set Theory | 8 |
3939

4040
## Citation

docs/index.html

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -176,8 +176,8 @@ <h2 class="title is-3">Abstract</h2>
176176
We present PutnamBench, a new multilingual benchmark for evaluating the ability of neural theorem-provers to solve competition mathematics problems.
177177
</p>
178178
<p>
179-
PutnamBench consists of 1697 hand-constructed formalizations of problems sourced from the William Lowell Putnam Mathematical Competition, the premier undergraduate-level mathematics competition in North America.
180-
All the theorems, of which there are 640, have formalizations in Lean 4 and Isabelle; a substantial subset also has Coq formalizations.
179+
PutnamBench consists of 1696 hand-constructed formalizations of problems sourced from the William Lowell Putnam Mathematical Competition, the premier undergraduate-level mathematics competition in North America.
180+
There are 644 problems formalized in Lean 4, 640 formalized in Isabelle, and 412 formalized in Coq.
181181
</p>
182182
<p>
183183
Proving the theorems requires significant problem-solving ability and proficiency in a broad range of topics taught in undergraduate mathematics courses. We use PutnamBench to evaluate several established neural and symbolic theorem-provers.

docs/leaderboard.html

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -134,7 +134,7 @@ <h3 class="fw-light text-nowrap">
134134
<!-- <div id="chart" style="width: 100%; height: 600px"></div> -->
135135
<div class="container-fluid d-flex flex-row flex-nowrap">
136136
<div class="container-fluid d-flex flex-column align-items-center">
137-
<label for="lean" style="font-size: 14px;" class="text-success mb-3">Lean (out of 640)</label>
137+
<label for="lean" style="font-size: 14px;" class="text-success mb-3">Lean (out of 644)</label>
138138
<table
139139
id="lean"
140140
class="table table-responsive table-striped table-bordered flex-shrink-1 border border-success border-3"
@@ -148,7 +148,7 @@ <h3 class="fw-light text-nowrap">
148148
></table>
149149
</div>
150150
<div class="container-fluid d-flex flex-column align-items-center">
151-
<label for="coq" style="font-size: 14px;" class="text-success mb-3">Coq (out of 417)</label>
151+
<label for="coq" style="font-size: 14px;" class="text-success mb-3">Coq (out of 412)</label>
152152
<table
153153
id="coq"
154154
class="table table-responsive table-striped table-bordered flex-shrink-1 border border-danger border-3"

docs/results.json

Lines changed: 15 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@
77
},
88
"size": 0,
99
"condensed-compute-budget": "pass@10",
10-
"note": "pass@10 w/ T=0.1"
10+
"note": "pass@10 w/ T=0.1; 1971 B1, 1986 B1, 1995 B1, 2001 A1."
1111
},
1212
"GPT-4o": {
1313
"link": "https://openai.com/index/hello-gpt-4o/",
@@ -19,7 +19,7 @@
1919
},
2020
"size": 0,
2121
"condensed-compute-budget": "pass@10",
22-
"note": "pass@10 w/ T=0.7"
22+
"note": "pass@10 w/ T=0.7; 1988 B1."
2323
},
2424
"COPRA (GPT-4o)": {
2525
"link": "https://arxiv.org/abs/2310.04353",
@@ -30,7 +30,7 @@
3030
},
3131
"size": 0,
3232
"condensed-compute-budget": "pass@1",
33-
"note": "pass@1 w/ T=0.0"
33+
"note": "pass@1 w/ T=0.0; 1988 B1."
3434
},
3535
"CoqHammer": {
3636
"link": "https://coqhammer.github.io/",
@@ -50,7 +50,7 @@
5050
},
5151
"size": 0,
5252
"condensed-compute-budget": "pass@1",
53-
"note": "pass@1 w/ t=120s"
53+
"note": "pass@1 w/ t=120s; 1971 B1, 2001 A1, 2012 A2."
5454
},
5555
"ReProver w/ retrieval": {
5656
"link": "https://arxiv.org/abs/2306.15626",
@@ -100,6 +100,16 @@
100100
},
101101
"size": 7,
102102
"condensed-compute-budget": "pass@596",
103-
"note": "7 hour cumulative online proof search on 256 GPUs"
103+
"note": "7 hour cumulative online proof search on 256 GPUs; 1977 A3, 1986 A1, 1986 A2, 1986 B1, 1988 B1, 1988 B2, 2001 A1. Commit: 8aaccf8"
104+
},
105+
"InternLM2.5-StepProver": {
106+
"link": "https://arxiv.org/pdf/2410.15700",
107+
"open-data": "FULL",
108+
"num-solved": {
109+
"lean-wsolution": 6
110+
},
111+
"size": 8.8,
112+
"condensed-compute-budget": "pass@2x32x600",
113+
"note": "pass @ 2 x 32 states x 600 state expansions proof search; 1977 A3, 1986 A1, 1986 B1, 1988 B1, 1988 B2, 2001 A1. Commit: b377431"
104114
}
105115
}

0 commit comments

Comments
 (0)