FM Agent MLE-Benchmark Results #80

GZL11 · 2025-10-10T09:39:53Z

Hello authors of MLE-Bench,
We are the FM Agent team from Baidu, and we are pleased to share that our FM Agent has achieved SOTA performance on the MLE-Bench benchmark.
Over recent months, we have developed an advanced agent based on the FM Agent framework that can systematically analyze problems and iteratively refine solutions to address complex end-to-end tasks, including various machine learning workloads.
To validate the workings of FM Agent, we conducted extensive experiments on MLE-Bench to rigorously validate our agent’s performance.
As part of this pull request, we are contributing:

Detailed grade reports in mle-bench/runs/fmagent_group[1-3], covering three independent runs across all competitions, following the standard evaluation practices recommended by MLE-Bench.
A summarized results table is included below, reporting the average performance across 3 runs for each evaluation.

Resources Used:

For each run: 64 vCPUs, 500GB RAM, 1× A800 GPU with a 24-hour time limit.
Resources were configured in line with the official MLE-Bench guidelines.

Final proposed new result:

Agent	LLM(s) used	Low == Lite (%)	Medium (%)	High (%)	All (%)	Running Time (hours)	Date	Grading Reports Available	Source Code Available
FM Agent	Gemini-2.5-Pro	62.12 ± 3.03	36.84 ± 2.63	33.33±0	43.56±1.78	24	2025-10-10	✓	X
Operand ensemble	gpt-5 (low verbosity/effort)1	63.64 ± 5.92	33.33 ± 4.42	20.00 ± 5.96	39.56 ± 3.26	24	2025-10-06	✓	X
InternAgent	deepseek-r1	62.12 ± 3.03	26.32 ± 2.63	24.44 ± 2.22	36.44 ± 1.18	12	2025-09-12	✓	X
R&D-Agent	gpt-5	68.18 ± 2.62	21.05 ± 1.52	22.22 ± 2.22	35.11 ± 0.44	12	2025-09-26	✓	✓
Neo multi-agent	undisclosed	48.48 ± 1.52	29.82 ± 2.32	24.44 ± 2.22	34.22 ± 0.89	36	2025-07-28	✓	X
R&D-Agent	o3 + GPT-4.1	51.52 ± 4	19.3 ± 3.16	26.67 ± 0	30.22 ± 0.89	24	2025-08-15	✓	✓
ML-Master	deepseek-r1	48.5 ± 1.5	20.2 ± 2.3	24.4 ± 2.2	29.3 ± 0.8	12	2025-06-17	✓	✓
R&D-Agent	o1-preview	48.18 ± 1.1	8.95 ± 1.05	18.67 ± 1.33	22.4 ± 0.5	24	2025-05-14	✓	✓
AIDE	o1-preview	34.3 ± 2.4	8.8 ± 1.1	10.0 ± 1.9	16.9 ± 1.1	24	2024-10-08	✓	✓
AIDE	gpt-4o-2024-08-06	19.0 ± 1.3	3.2 ± 0.5	5.6 ± 1.0	8.6 ± 0.5	24	2024-10-08	✓	✓
AIDE	claude-3-5-sonnet-20240620	19.4 ± 4.9	2.6 ± 1.5	2.3 ± 2.3	7.5 ± 1.8	24	2024-10-08	✓	✓
OpenHands	gpt-4o-2024-08-06	11.5 ± 3.4	2.2 ± 1.3	1.9 ± 1.9	5.1 ± 1.3	24	2024-10-08	✓	✓
AIDE	llama-3.1-405b-instruct	8.3 ± 2.6	1.2 ± 0.8	0.0 ± 0.0	3.1 ± 0.9	24	2024-10-08	✓	✓
MLAB	gpt-4o-2024-08-06	4.2 ± 1.5	0.0 ± 0.0	0.0 ± 0.0	1.3 ± 0.5	24	2024-10-08	✓	✓

Our technical report—containing more detailed insights into our work—is coming soon! We are truly grateful for the chance to contribute to MLE-Bench, with the sincere hope that our findings will bring value to the broader community. We also eagerly anticipate your feedback and are committed to advancing the merging of this pull request promptly.
Best regards,
The FM Agent Team (Baidu)

thesofakillers · 2025-10-16T07:58:00Z

Hi FM Agent team (cc @GZL11), thank you for the submission.

Could you clarify the +/- 0 on the high split?

Seems surprisingly precise!

GZL11 · 2025-10-16T08:32:33Z

Hi FM Agent team (cc @GZL11), thank you for the submission.

Could you clarify the +/- 0 on the high split?

Seems surprisingly precise!

Hi authors of MLE-Bench(cc @thesofakillers),
Thank you very much for your feedback and for taking the time to review our submission.
To clarify the "±0" result on the high split: our FM Agent achieved strong performance on a specific subset of competitions, as reflected in the grade reports we uploaded (located in mle-bench/runs/fmagent_group[1-3]). Specifically, in the following five competitions, we consistently earned full medals across all three runs:

iwildcam-2019-fgvc6
predict-volcanic-eruptions-ingv-oe
rsna-miccai-brain-tumor-radiogenomic-classification
stanford-covid-vaccine
vinbigdata-chest-xray-abnormalities-detection

Interestingly, we found that these competitions did not require extensive time or iterative tuning—many were solved within 12 hours, and none exceeded 24 hours to reach medal performance. In contrast, for the remaining competitions in the high split, we were unable to achieve medal-level results within the 24-hour time limit, though we did obtain reasonably competitive scores.
This pattern appears consistent with other participants on the leaderboard, who also performed well on these above tasks but struggled to make similar breakthroughs in the others. This supports our observation that certain competitions in the high split are more amenable to rapid progress with our FM Agent approach. Further architectural details and analysis will be available in our forthcoming technical report.
Thank you again for your attention, and please feel free to let us know if you have any further questions.
Best regards, FM Agent Team

thesofakillers

Thanks for clarifying about the +/- 0, that sounds reasonable to me.

I think this looks all correct, but we need the grading reports and run_group_experiments.csv to be tracked with git LFS please. Thanks!

thesofakillers · 2025-10-16T16:39:43Z

runs/fmagent_group1/grading_report_group_1.json

these need to be LFS files

Thank you very much for your reminder. It has now been corrected!

thesofakillers · 2025-10-16T16:39:58Z

runs/run_group_experiments.csv

let's keep this as an LFS file as before

Thank you very much for your reminder. It has now been corrected!

GZL11 · 2025-10-17T02:37:24Z

Thanks for clarifying about the +/- 0, that sounds reasonable to me.

I think this looks all correct, but we need the grading reports and run_group_experiments.csv to be tracked with git LFS please. Thanks!

Thank you very much for your reminder. I sincerely apologize for not noticing this point when uploading previously. It has now been corrected!

thesofakillers

LGTM! congrats

thesofakillers · 2025-10-17T14:17:58Z

I sincerely apologize

no worries!

gezengle and others added 2 commits October 10, 2025 17:35

FM Agent MLE-Benchmark Results

3a721be

Merge branch 'main' into main

1eeae0f

thesofakillers reviewed Oct 16, 2025

View reviewed changes

gezengle added 5 commits October 17, 2025 10:02

Update FM Agent requires files uploaded using git lfs

3a5ac17

Merge branch 'main' of https://github.com/GZL11/mle-bench

e1c08c5

Update FM Agent requires files uploaded using git lfs

cb15e99

track runs/*/.json and *.csv with LFS

23d9de9

track runs/*/.json and *.csv with LFS

ad337d4

GZL11 force-pushed the main branch from 5683e67 to ad337d4 Compare October 17, 2025 02:32

thesofakillers approved these changes Oct 17, 2025

View reviewed changes

thesofakillers merged commit 630e1f4 into openai:main Oct 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

FM Agent MLE-Benchmark Results #80

FM Agent MLE-Benchmark Results #80

GZL11 commented Oct 10, 2025 •

edited

Loading

Uh oh!

thesofakillers commented Oct 16, 2025

Uh oh!

GZL11 commented Oct 16, 2025 •

edited

Loading

Uh oh!

thesofakillers left a comment

Uh oh!

thesofakillers Oct 16, 2025

Uh oh!

GZL11 Oct 17, 2025

Uh oh!

thesofakillers Oct 16, 2025

Uh oh!

GZL11 Oct 17, 2025

Uh oh!

GZL11 commented Oct 17, 2025

Uh oh!

thesofakillers left a comment

Uh oh!

thesofakillers commented Oct 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

FM Agent MLE-Benchmark Results #80

FM Agent MLE-Benchmark Results #80

Conversation

GZL11 commented Oct 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

thesofakillers commented Oct 16, 2025

Uh oh!

GZL11 commented Oct 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

thesofakillers left a comment

Choose a reason for hiding this comment

Uh oh!

thesofakillers Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

GZL11 Oct 17, 2025

Choose a reason for hiding this comment

Uh oh!

thesofakillers Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

GZL11 Oct 17, 2025

Choose a reason for hiding this comment

Uh oh!

GZL11 commented Oct 17, 2025

Uh oh!

thesofakillers left a comment

Choose a reason for hiding this comment

Uh oh!

thesofakillers commented Oct 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

GZL11 commented Oct 10, 2025 •

edited

Loading

GZL11 commented Oct 16, 2025 •

edited

Loading