Papers
Topics
Authors
Recent
2000 character limit reached

MLE-Bench: Autonomous ML Engineering Benchmark

Updated 15 November 2025
  • MLE-Bench is a benchmark suite that models machine learning engineering as a search problem over executable artifacts derived from Kaggle competitions.
  • It compares search policies like Greedy, MCTS, and Evolutionary methods, revealing that advanced operators can boost medal success rates to over 50%.
  • Enhanced operator sets such as AIRA markedly improve performance by balancing exploration and exploitation while mitigating overfitting in validation-based evaluations.

MLE-Bench is a benchmark suite for evaluating the ability of AI agents to autonomously engage in end-to-end machine learning engineering (MLE), operationalized through a diverse set of real-world Kaggle competitions. Each MLE-Bench task formalizes machine learning engineering as a search problem over the space of executable artifacts, providing a rigorous, reproducible testbed for agentic advances in automated machine learning, search strategies, and operator design (Toledo et al., 3 Jul 2025, Chan et al., 9 Oct 2024).

1. Formalization and Problem Structure

An MLE-Bench problem is defined as a search problem over the discrete space SS of ML solution artifacts. An artifact sSs \in S is typically a Jupyter notebook or Python script that 1) ingests a provided data directory, 2) defines and trains a model or pipeline, and 3) produces a submission.csv file in the specified Kaggle format. The search process operates via a finite set of high-level operators O={o1,...,oL}\mathcal{O} = \{ o_1, ..., o_L \}; each operator o:SmSo_\ell:S^m \to S (typically m=1m=1 or $2$) transforms existing artifacts. The canonical operators are:

  • Draft: oDraft(v0)vo_{\text{Draft}}(v_0) \mapsto v, synthesizes an initial candidate from scratch.
  • Improve: oImprove(v)vo_{\text{Improve}}(v) \mapsto v', incrementally refines an artifact.
  • Debug: oDebug(v)vo_{\text{Debug}}(v) \mapsto v', repairs syntactic/semantic errors.
  • Memory: oMemory({vi})o_{\text{Memory}}(\{v_i\}), injects context from previous artifacts.
  • Crossover: oXover(vi,vj)vo_{\text{Xover}}(v_i, v_j) \mapsto v', recombines solutions.

Evaluation functions f:SRf: S \to \mathbb{R} return the 5-fold cross-validated performance of ss, typically rescaled to [0,1][0,1]. Critically, agents receive f(s)f(s) computed only on a held-out validation split, while the true contest metric on test data (which determines medals) is unseen during search.

At each iteration tt, agents maintain a search graph Gt=(Vt,Et)\mathcal{G}_t = (V_t,E_t), where nodes are artifacts VtSV_t \subseteq S and edges are labeled by the operator that produced each artifact from its predecessor. The root v0v_0 is an empty starting artifact.

2. Search Strategies and Algorithmic Frameworks

Three principal search policies are systematically studied with MLE-Bench:

2.1 Greedy Search (AIDE-Style)

This policy initializes with a small number ndn_d of DRAFTs, then iteratively selects v=argmaxvVf(v)v^\star = \arg\max_{v \in V} f(v) and applies:

  • DRAFT if not yet ndn_d drafts,
  • IMPROVE if vv^\star is valid,
  • DEBUG otherwise.

Greedy strictly exploits the best available node at each step. Hyperparameters include ndn_d and an exploration probability εbug\varepsilon_{\text{bug}} for revisiting buggy artifacts.

2.2 Monte-Carlo Tree Search (MCTS)

MCTS augments each node vv with visit counts N(v)N(v) and mean value estimates Q(v)Q(v). At each episode, it:

  1. Selects descendants to maximize Q(w)+clnN(u)N(w)+ϵQ(w) + c\sqrt{ \frac{ \ln N(u) }{ N(w) + \epsilon } } (UCT formula).
  2. Expands by sampling a valid operator to generate child nodes.
  3. Evaluates f(v)f(v_\ell) for new nodes.
  4. Backs up results along the path to the root. The parameter cc modulates the exploration-exploitation balance.

Maintains a fixed population VtV_t of nn artifacts. In each round, parents are sampled proportionally to f(v)f(v), offspring are generated via IMPROVE (with probability pimpp_{\text{imp}}) or CROSSOVER, followed by DEBUG as needed. Offspring replace the lowest-fitness individuals in VtV_t.

3. Operator Set Design

Two main operator families are benchmarked to assess the impact of operator sophistication:

Operator Set Key Elements Innovations
OAIDE\mathcal{O}_{\text{AIDE}} (baseline) Draft, Improve, Debug, Memory (prompt-based LLM calls) Memory = concat. of all prior artifacts
OAIRA\mathcal{O}_{\text{AIRA}} (enhanced) All above plus: prompt-adaptive complexity, scoped memory, think-tokens Dynamic prompt complexity; chain-of-thought; context limiting
  • Prompt-adaptive complexity: The system prompt selects "simple", "moderate", or "advanced" based on the node’s out-degree to prevent over-engineering in early search.
  • Scoped memory: Only Draft/Improve see siblings; Debug operators receive the full debug chain.
  • Think-tokens: Operators prompt for explicit, hidden chains of reasoning, doubling reasoning-token usage.

Operator selection policy πop(v)\pi_{\text{op}}(v) is uniform for MCTS/evolutionary, and rule-based for Greedy.

4. Experimental Protocols and Results on MLE-Bench Lite

A subset of 22 Kaggle problems ("MLE-Bench Lite") is used for controlled experiments, with a 24-h wall-clock/task and hardware budget of 1 H200 GPU and 24 CPUs. The primary evaluation metric is "medal success rate": percentage of tasks where the agent achieves at least a Kaggle bronze.

Key results:

  • Baseline vs Enhanced Operators: Upgrading from OAIDE\mathcal{O}_{\text{AIDE}} under Greedy lifts mean medal rate from 39.6% to 45.5% (+14% relative).
  • Full Comparison of Search Policies: The highest medal rate is achieved by MCTS + OAIRA\mathcal{O}_{\text{AIRA}} at 47.7%. Evolutionary search with OAIRA\mathcal{O}_{\text{AIRA}} yields 46.9%.
  • Anytime profiles: All enhanced agents outperform the baseline after approximately 15 hours. Extending run time to 90 hours with MCTS yields up to 53% medal rate, after which overfitting becomes limiting.
Agent Configuration Any Medal Silver+ Gold Only
AIRAgreedy_{\text{greedy}} (AIRA, Greedy) 45.5% 34.2% 23.8%
AIRAMCTS_{\text{MCTS}} (AIRA, MCTS) 47.7% 36.7% 27.2%
AIRAevolutionary_{\text{evolutionary}} (AIRA, Evolutionary) 46.9% 37.1% 27.5%

Experiments demonstrate that stronger operator sets are a precondition for realizing gains from non-greedy search: with OAIDE\mathcal{O}_{\text{AIDE}}, MCTS or evolutionary search improves medal rates by only 1–2% over Greedy, whereas OAIRA\mathcal{O}_{\text{AIRA}} achieves a 14% relative jump even in Greedy. Only with strong operators do advanced policies provide additional improvements (~2–3% more).

5. Analysis of Search–Operator Interplay and Generalization Effects

The bottleneck for agent performance is operator strength, not search sophistication, in the low-capability regime. Advanced operators expand the search graph into more promising regions of SS, and local moves like Improve/Crossover must reliably yield improvements in ff before MCTS or evolutionary methods can efficiently allocate compute to diverse, potentially high-fitness branches. This is evidenced by the heightened performance gains only when OAIRA\mathcal{O}_{\text{AIRA}} is employed.

Overfitting is a persistent failure mode: the validation-based artifact selection systematically underperforms an oracle that chooses by actual private-test score by 9–13% absolute. Implementing a multi-submission strategy ("submit the top-kk validation artifacts, take the best test result") is effective: for k5k \leq 5, most of the generalization gap is closed.

6. Conclusions and Future Research Avenues

The principal insight from MLE-Bench studies is the layered importance of (1) operator quality, (2) search policy, and (3) proxy evaluation fidelity. Agentic progress on this suite is contingent on the joint design of strong, context/manipulation-aware operators and global, exploration-capable search policies. Operator-centric innovation such as agentic submodules (e.g., software engineering agents), operator LLM fine-tuning (supervised or RL paradigms), and computation/memory scaling represent open frontiers.

Recommended avenues for future research include: embedding software engineering agents directly into the operator set, specializing operators via supervised or reinforcement learning, scaling to longer task horizons, and benchmarking in continuous, contamination-controlled streams. Robust artifact selection techniques beyond simple argmax-on-validation (multi-arm or uncertainty-aware strategies) are also highlighted as impactful, low-cost strategies against overfitting.

7. Significance and Benchmark Availability

MLE-Bench provides a high-fidelity environment for quantifying progress in automated machine learning engineering. Its structure—real-world Kaggle contests, rigorous human baselines, modular agent scaffolds, and transparent metrics—facilitates trustworthy, reproducible evaluation. Open-source releases of datasets, grading scripts, Dockerized scaffolds, and analytical tools are available at https://github.com/openai/mle-bench/. The benchmark is actively maintained with periodic updates to counter pretraining contamination and will continue to serve as a de facto yardstick for autonomous ML engineering research (Toledo et al., 3 Jul 2025, Chan et al., 9 Oct 2024).

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to MLE-Bench.