MLE-Bench: Autonomous ML Engineering Benchmark

Updated 15 November 2025

MLE-Bench is a benchmark suite that models machine learning engineering as a search problem over executable artifacts derived from Kaggle competitions.
It compares search policies like Greedy, MCTS, and Evolutionary methods, revealing that advanced operators can boost medal success rates to over 50%.
Enhanced operator sets such as AIRA markedly improve performance by balancing exploration and exploitation while mitigating overfitting in validation-based evaluations.

MLE-Bench is a benchmark suite for evaluating the ability of AI agents to autonomously engage in end-to-end machine learning engineering (MLE), operationalized through a diverse set of real-world Kaggle competitions. Each MLE-Bench task formalizes machine learning engineering as a search problem over the space of executable artifacts, providing a rigorous, reproducible testbed for agentic advances in automated machine learning, search strategies, and operator design (Toledo et al., 3 Jul 2025, Chan et al., 2024).

1. Formalization and Problem Structure

An MLE-Bench problem is defined as a search problem over the discrete space $S$ of ML solution artifacts. An artifact $s \in S$ is typically a Jupyter notebook or Python script that 1) ingests a provided data directory, 2) defines and trains a model or pipeline, and 3) produces a submission.csv file in the specified Kaggle format. The search process operates via a finite set of high-level operators $\mathcal{O} = \{ o_1, ..., o_L \}$ ; each operator $o_\ell:S^m \to S$ (typically $m=1$ or $2$) transforms existing artifacts. The canonical operators are:

Draft: $o_{\text{Draft}}(v_0) \mapsto v$ , synthesizes an initial candidate from scratch.
Improve: $o_{\text{Improve}}(v) \mapsto v'$ , incrementally refines an artifact.
Debug: $o_{\text{Debug}}(v) \mapsto v'$ , repairs syntactic/semantic errors.
Memory: $o_{\text{Memory}}(\{v_i\})$ , injects context from previous artifacts.
Crossover: $o_{\text{Xover}}(v_i, v_j) \mapsto v'$ , recombines solutions.

Evaluation functions $f: S \to \mathbb{R}$ return the 5-fold cross-validated performance of $s$ , typically rescaled to $[0,1]$ . Critically, agents receive $f(s)$ computed only on a held-out validation split, while the true contest metric on test data (which determines medals) is unseen during search.

At each iteration $t$ , agents maintain a search graph $\mathcal{G}_t = (V_t,E_t)$ , where nodes are artifacts $V_t \subseteq S$ and edges are labeled by the operator that produced each artifact from its predecessor. The root $v_0$ is an empty starting artifact.

2. Search Strategies and Algorithmic Frameworks

Three principal search policies are systematically studied with MLE-Bench:

2.1 Greedy Search (AIDE-Style)

This policy initializes with a small number $n_d$ of DRAFTs, then iteratively selects $v^\star = \arg\max_{v \in V} f(v)$ and applies:

DRAFT if not yet $n_d$ drafts,
IMPROVE if $v^\star$ is valid,
DEBUG otherwise.

Greedy strictly exploits the best available node at each step. Hyperparameters include $n_d$ and an exploration probability $\varepsilon_{\text{bug}}$ for revisiting buggy artifacts.

2.2 Monte-Carlo Tree Search (MCTS)

MCTS augments each node $v$ with visit counts $N(v)$ and mean value estimates $Q(v)$ . At each episode, it:

Selects descendants to maximize $Q(w) + c\sqrt{ \frac{ \ln N(u) }{ N(w) + \epsilon } }$ (UCT formula).
Expands by sampling a valid operator to generate child nodes.
Evaluates $f(v_\ell)$ for new nodes.
Backs up results along the path to the root. The parameter $c$ modulates the exploration-exploitation balance.

2.3 Evolutionary Search

Maintains a fixed population $V_t$ of $n$ artifacts. In each round, parents are sampled proportionally to $f(v)$ , offspring are generated via IMPROVE (with probability $p_{\text{imp}}$ ) or CROSSOVER, followed by DEBUG as needed. Offspring replace the lowest-fitness individuals in $V_t$ .

3. Operator Set Design

Two main operator families are benchmarked to assess the impact of operator sophistication:

Operator Set	Key Elements	Innovations
$\mathcal{O}_{\text{AIDE}}$ (baseline)	Draft, Improve, Debug, Memory (prompt-based LLM calls)	Memory = concat. of all prior artifacts
$\mathcal{O}_{\text{AIRA}}$ (enhanced)	All above plus: prompt-adaptive complexity, scoped memory, think-tokens	Dynamic prompt complexity; chain-of-thought; context limiting

Prompt-adaptive complexity: The system prompt selects "simple", "moderate", or "advanced" based on the node’s out-degree to prevent over-engineering in early search.
Scoped memory: Only Draft/Improve see siblings; Debug operators receive the full debug chain.
Think-tokens: Operators prompt for explicit, hidden chains of reasoning, doubling reasoning-token usage.

Operator selection policy $\pi_{\text{op}}(v)$ is uniform for MCTS/evolutionary, and rule-based for Greedy.

4. Experimental Protocols and Results on MLE-Bench Lite

A subset of 22 Kaggle problems ("MLE-Bench Lite") is used for controlled experiments, with a 24-h wall-clock/task and hardware budget of 1 H200 GPU and 24 CPUs. The primary evaluation metric is "medal success rate": percentage of tasks where the agent achieves at least a Kaggle bronze.

Key results:

Baseline vs Enhanced Operators: Upgrading from $\mathcal{O}_{\text{AIDE}}$ under Greedy lifts mean medal rate from 39.6% to 45.5% (+14% relative).
Full Comparison of Search Policies: The highest medal rate is achieved by MCTS + $\mathcal{O}_{\text{AIRA}}$ at 47.7%. Evolutionary search with $\mathcal{O}_{\text{AIRA}}$ yields 46.9%.
Anytime profiles: All enhanced agents outperform the baseline after approximately 15 hours. Extending run time to 90 hours with MCTS yields up to 53% medal rate, after which overfitting becomes limiting.

Agent Configuration	Any Medal	Silver+	Gold Only
AIRA $_{\text{greedy}}$ (AIRA, Greedy)	45.5%	34.2%	23.8%
AIRA $_{\text{MCTS}}$ (AIRA, MCTS)	47.7%	36.7%	27.2%
AIRA $_{\text{evolutionary}}$ (AIRA, Evolutionary)	46.9%	37.1%	27.5%

Experiments demonstrate that stronger operator sets are a precondition for realizing gains from non-greedy search: with $\mathcal{O}_{\text{AIDE}}$ , MCTS or evolutionary search improves medal rates by only 1–2% over Greedy, whereas $\mathcal{O}_{\text{AIRA}}$ achieves a 14% relative jump even in Greedy. Only with strong operators do advanced policies provide additional improvements (~2–3% more).

5. Analysis of Search–Operator Interplay and Generalization Effects

The bottleneck for agent performance is operator strength, not search sophistication, in the low-capability regime. Advanced operators expand the search graph into more promising regions of $S$ , and local moves like Improve/Crossover must reliably yield improvements in $f$ before MCTS or evolutionary methods can efficiently allocate compute to diverse, potentially high-fitness branches. This is evidenced by the heightened performance gains only when $\mathcal{O}_{\text{AIRA}}$ is employed.

Overfitting is a persistent failure mode: the validation-based artifact selection systematically underperforms an oracle that chooses by actual private-test score by 9–13% absolute. Implementing a multi-submission strategy ("submit the top- $k$ validation artifacts, take the best test result") is effective: for $k \leq 5$ , most of the generalization gap is closed.

6. Conclusions and Future Research Avenues

The principal insight from MLE-Bench studies is the layered importance of (1) operator quality, (2) search policy, and (3) proxy evaluation fidelity. Agentic progress on this suite is contingent on the joint design of strong, context/manipulation-aware operators and global, exploration-capable search policies. Operator-centric innovation such as agentic submodules (e.g., software engineering agents), operator LLM fine-tuning (supervised or RL paradigms), and computation/memory scaling represent open frontiers.

Recommended avenues for future research include: embedding software engineering agents directly into the operator set, specializing operators via supervised or reinforcement learning, scaling to longer task horizons, and benchmarking in continuous, contamination-controlled streams. Robust artifact selection techniques beyond simple argmax-on-validation (multi-arm or uncertainty-aware strategies) are also highlighted as impactful, low-cost strategies against overfitting.

7. Significance and Benchmark Availability

MLE-Bench provides a high-fidelity environment for quantifying progress in automated machine learning engineering. Its structure—real-world Kaggle contests, rigorous human baselines, modular agent scaffolds, and transparent metrics—facilitates trustworthy, reproducible evaluation. Open-source releases of datasets, grading scripts, Dockerized scaffolds, and analytical tools are available at https://github.com/openai/mle-bench/. The benchmark is actively maintained with periodic updates to counter pretraining contamination and will continue to serve as a de facto yardstick for autonomous ML engineering research (Toledo et al., 3 Jul 2025, Chan et al., 2024).

PDF Markdown Chat (Pro)

References (2)

AI Research Agents for Machine Learning: Search, Exploration, and Generalization in MLE-bench (2025)

MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering (2024)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to MLE-Bench.