Skip to content

Conversation

GZL11
Copy link
Contributor

@GZL11 GZL11 commented Oct 10, 2025

Hello authors of MLE-Bench,
We are the FM Agent team from Baidu, and we are pleased to share that our FM Agent has achieved SOTA performance on the MLE-Bench benchmark.
Over recent months, we have developed an advanced agent based on the FM Agent framework that can systematically analyze problems and iteratively refine solutions to address complex end-to-end tasks, including various machine learning workloads.
To validate the workings of FM Agent, we conducted extensive experiments on MLE-Bench to rigorously validate our agent’s performance.
As part of this pull request, we are contributing:

  • Detailed grade reports in mle-bench/runs/fmagent_group[1-3], covering three independent runs across all competitions, following the standard evaluation practices recommended by MLE-Bench.
  • A summarized results table is included below, reporting the average performance across 3 runs for each evaluation.

Resources Used:

  • For each run: 64 vCPUs, 500GB RAM, 1× A800 GPU with a 24-hour time limit.
  • Resources were configured in line with the official MLE-Bench guidelines.

Final proposed new result:

Agent LLM(s) used Low == Lite (%) Medium (%) High (%) All (%) Running Time (hours) Date Grading Reports Available Source Code Available
FM Agent Gemini-2.5-Pro 62.12 ± 3.03 36.84 ± 2.63 33.33±0 43.56±1.78 24 2025-10-10 X
Operand ensemble gpt-5 (low verbosity/effort)1 63.64 ± 5.92 33.33 ± 4.42 20.00 ± 5.96 39.56 ± 3.26 24 2025-10-06 X
InternAgent deepseek-r1 62.12 ± 3.03 26.32 ± 2.63 24.44 ± 2.22 36.44 ± 1.18 12 2025-09-12 X
R&D-Agent gpt-5 68.18 ± 2.62 21.05 ± 1.52 22.22 ± 2.22 35.11 ± 0.44 12 2025-09-26
Neo multi-agent undisclosed 48.48 ± 1.52 29.82 ± 2.32 24.44 ± 2.22 34.22 ± 0.89 36 2025-07-28 X
R&D-Agent o3 + GPT-4.1 51.52 ± 4 19.3 ± 3.16 26.67 ± 0 30.22 ± 0.89 24 2025-08-15
ML-Master deepseek-r1 48.5 ± 1.5 20.2 ± 2.3 24.4 ± 2.2 29.3 ± 0.8 12 2025-06-17
R&D-Agent o1-preview 48.18 ± 1.1 8.95 ± 1.05 18.67 ± 1.33 22.4 ± 0.5 24 2025-05-14
AIDE o1-preview 34.3 ± 2.4 8.8 ± 1.1 10.0 ± 1.9 16.9 ± 1.1 24 2024-10-08
AIDE gpt-4o-2024-08-06 19.0 ± 1.3 3.2 ± 0.5 5.6 ± 1.0 8.6 ± 0.5 24 2024-10-08
AIDE claude-3-5-sonnet-20240620 19.4 ± 4.9 2.6 ± 1.5 2.3 ± 2.3 7.5 ± 1.8 24 2024-10-08
OpenHands gpt-4o-2024-08-06 11.5 ± 3.4 2.2 ± 1.3 1.9 ± 1.9 5.1 ± 1.3 24 2024-10-08
AIDE llama-3.1-405b-instruct 8.3 ± 2.6 1.2 ± 0.8 0.0 ± 0.0 3.1 ± 0.9 24 2024-10-08
MLAB gpt-4o-2024-08-06 4.2 ± 1.5 0.0 ± 0.0 0.0 ± 0.0 1.3 ± 0.5 24 2024-10-08

Our technical report—containing more detailed insights into our work—is coming soon! We are truly grateful for the chance to contribute to MLE-Bench, with the sincere hope that our findings will bring value to the broader community. We also eagerly anticipate your feedback and are committed to advancing the merging of this pull request promptly.
Best regards,
The FM Agent Team (Baidu)

@thesofakillers
Copy link
Collaborator

Hi FM Agent team (cc @GZL11), thank you for the submission.

Could you clarify the +/- 0 on the high split?

Seems surprisingly precise!

@GZL11
Copy link
Contributor Author

GZL11 commented Oct 16, 2025

Hi FM Agent team (cc @GZL11), thank you for the submission.

Could you clarify the +/- 0 on the high split?

Seems surprisingly precise!

Hi authors of MLE-Bench(cc @thesofakillers),
Thank you very much for your feedback and for taking the time to review our submission.
To clarify the "±0" result on the high split: our FM Agent achieved strong performance on a specific subset of competitions, as reflected in the grade reports we uploaded (located in mle-bench/runs/fmagent_group[1-3]). Specifically, in the following five competitions, we consistently earned full medals across all three runs:

  • iwildcam-2019-fgvc6
  • predict-volcanic-eruptions-ingv-oe
  • rsna-miccai-brain-tumor-radiogenomic-classification
  • stanford-covid-vaccine
  • vinbigdata-chest-xray-abnormalities-detection

Interestingly, we found that these competitions did not require extensive time or iterative tuning—many were solved within 12 hours, and none exceeded 24 hours to reach medal performance. In contrast, for the remaining competitions in the high split, we were unable to achieve medal-level results within the 24-hour time limit, though we did obtain reasonably competitive scores.
This pattern appears consistent with other participants on the leaderboard, who also performed well on these above tasks but struggled to make similar breakthroughs in the others. This supports our observation that certain competitions in the high split are more amenable to rapid progress with our FM Agent approach. Further architectural details and analysis will be available in our forthcoming technical report.
Thank you again for your attention, and please feel free to let us know if you have any further questions.
Best regards,
FM Agent Team

Copy link
Collaborator

@thesofakillers thesofakillers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for clarifying about the +/- 0, that sounds reasonable to me.

I think this looks all correct, but we need the grading reports and run_group_experiments.csv to be tracked with git LFS please. Thanks!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these need to be LFS files

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for your reminder. It has now been corrected!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's keep this as an LFS file as before

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for your reminder. It has now been corrected!

@GZL11
Copy link
Contributor Author

GZL11 commented Oct 17, 2025

Thanks for clarifying about the +/- 0, that sounds reasonable to me.

I think this looks all correct, but we need the grading reports and run_group_experiments.csv to be tracked with git LFS please. Thanks!

Thank you very much for your reminder. I sincerely apologize for not noticing this point when uploading previously. It has now been corrected!

Copy link
Collaborator

@thesofakillers thesofakillers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! congrats

@thesofakillers thesofakillers merged commit 630e1f4 into openai:main Oct 17, 2025
@thesofakillers
Copy link
Collaborator

I sincerely apologize

no worries!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants