MLE-Bench: Autonomous ML Engineering Benchmark
- MLE-Bench is a benchmark suite that models machine learning engineering as a search problem over executable artifacts derived from Kaggle competitions.
- It compares search policies like Greedy, MCTS, and Evolutionary methods, revealing that advanced operators can boost medal success rates to over 50%.
- Enhanced operator sets such as AIRA markedly improve performance by balancing exploration and exploitation while mitigating overfitting in validation-based evaluations.
MLE-Bench is a benchmark suite for evaluating the ability of AI agents to autonomously engage in end-to-end machine learning engineering (MLE), operationalized through a diverse set of real-world Kaggle competitions. Each MLE-Bench task formalizes machine learning engineering as a search problem over the space of executable artifacts, providing a rigorous, reproducible testbed for agentic advances in automated machine learning, search strategies, and operator design (Toledo et al., 3 Jul 2025, Chan et al., 9 Oct 2024).
1. Formalization and Problem Structure
An MLE-Bench problem is defined as a search problem over the discrete space of ML solution artifacts. An artifact is typically a Jupyter notebook or Python script that 1) ingests a provided data directory, 2) defines and trains a model or pipeline, and 3) produces a submission.csv file in the specified Kaggle format. The search process operates via a finite set of high-level operators ; each operator (typically or $2$) transforms existing artifacts. The canonical operators are:
- Draft: , synthesizes an initial candidate from scratch.
- Improve: , incrementally refines an artifact.
- Debug: , repairs syntactic/semantic errors.
- Memory: , injects context from previous artifacts.
- Crossover: , recombines solutions.
Evaluation functions return the 5-fold cross-validated performance of , typically rescaled to . Critically, agents receive computed only on a held-out validation split, while the true contest metric on test data (which determines medals) is unseen during search.
At each iteration , agents maintain a search graph , where nodes are artifacts and edges are labeled by the operator that produced each artifact from its predecessor. The root is an empty starting artifact.
2. Search Strategies and Algorithmic Frameworks
Three principal search policies are systematically studied with MLE-Bench:
2.1 Greedy Search (AIDE-Style)
This policy initializes with a small number of DRAFTs, then iteratively selects and applies:
- DRAFT if not yet drafts,
- IMPROVE if is valid,
- DEBUG otherwise.
Greedy strictly exploits the best available node at each step. Hyperparameters include and an exploration probability for revisiting buggy artifacts.
2.2 Monte-Carlo Tree Search (MCTS)
MCTS augments each node with visit counts and mean value estimates . At each episode, it:
- Selects descendants to maximize (UCT formula).
- Expands by sampling a valid operator to generate child nodes.
- Evaluates for new nodes.
- Backs up results along the path to the root. The parameter modulates the exploration-exploitation balance.
2.3 Evolutionary Search
Maintains a fixed population of artifacts. In each round, parents are sampled proportionally to , offspring are generated via IMPROVE (with probability ) or CROSSOVER, followed by DEBUG as needed. Offspring replace the lowest-fitness individuals in .
3. Operator Set Design
Two main operator families are benchmarked to assess the impact of operator sophistication:
| Operator Set | Key Elements | Innovations |
|---|---|---|
| (baseline) | Draft, Improve, Debug, Memory (prompt-based LLM calls) | Memory = concat. of all prior artifacts |
| (enhanced) | All above plus: prompt-adaptive complexity, scoped memory, think-tokens | Dynamic prompt complexity; chain-of-thought; context limiting |
- Prompt-adaptive complexity: The system prompt selects
"simple","moderate", or"advanced"based on the node’s out-degree to prevent over-engineering in early search. - Scoped memory: Only Draft/Improve see siblings; Debug operators receive the full debug chain.
- Think-tokens: Operators prompt for explicit, hidden chains of reasoning, doubling reasoning-token usage.
Operator selection policy is uniform for MCTS/evolutionary, and rule-based for Greedy.
4. Experimental Protocols and Results on MLE-Bench Lite
A subset of 22 Kaggle problems ("MLE-Bench Lite") is used for controlled experiments, with a 24-h wall-clock/task and hardware budget of 1 H200 GPU and 24 CPUs. The primary evaluation metric is "medal success rate": percentage of tasks where the agent achieves at least a Kaggle bronze.
Key results:
- Baseline vs Enhanced Operators: Upgrading from under Greedy lifts mean medal rate from 39.6% to 45.5% (+14% relative).
- Full Comparison of Search Policies: The highest medal rate is achieved by MCTS + at 47.7%. Evolutionary search with yields 46.9%.
- Anytime profiles: All enhanced agents outperform the baseline after approximately 15 hours. Extending run time to 90 hours with MCTS yields up to 53% medal rate, after which overfitting becomes limiting.
| Agent Configuration | Any Medal | Silver+ | Gold Only |
|---|---|---|---|
| AIRA (AIRA, Greedy) | 45.5% | 34.2% | 23.8% |
| AIRA (AIRA, MCTS) | 47.7% | 36.7% | 27.2% |
| AIRA (AIRA, Evolutionary) | 46.9% | 37.1% | 27.5% |
Experiments demonstrate that stronger operator sets are a precondition for realizing gains from non-greedy search: with , MCTS or evolutionary search improves medal rates by only 1–2% over Greedy, whereas achieves a 14% relative jump even in Greedy. Only with strong operators do advanced policies provide additional improvements (~2–3% more).
5. Analysis of Search–Operator Interplay and Generalization Effects
The bottleneck for agent performance is operator strength, not search sophistication, in the low-capability regime. Advanced operators expand the search graph into more promising regions of , and local moves like Improve/Crossover must reliably yield improvements in before MCTS or evolutionary methods can efficiently allocate compute to diverse, potentially high-fitness branches. This is evidenced by the heightened performance gains only when is employed.
Overfitting is a persistent failure mode: the validation-based artifact selection systematically underperforms an oracle that chooses by actual private-test score by 9–13% absolute. Implementing a multi-submission strategy ("submit the top- validation artifacts, take the best test result") is effective: for , most of the generalization gap is closed.
6. Conclusions and Future Research Avenues
The principal insight from MLE-Bench studies is the layered importance of (1) operator quality, (2) search policy, and (3) proxy evaluation fidelity. Agentic progress on this suite is contingent on the joint design of strong, context/manipulation-aware operators and global, exploration-capable search policies. Operator-centric innovation such as agentic submodules (e.g., software engineering agents), operator LLM fine-tuning (supervised or RL paradigms), and computation/memory scaling represent open frontiers.
Recommended avenues for future research include: embedding software engineering agents directly into the operator set, specializing operators via supervised or reinforcement learning, scaling to longer task horizons, and benchmarking in continuous, contamination-controlled streams. Robust artifact selection techniques beyond simple argmax-on-validation (multi-arm or uncertainty-aware strategies) are also highlighted as impactful, low-cost strategies against overfitting.
7. Significance and Benchmark Availability
MLE-Bench provides a high-fidelity environment for quantifying progress in automated machine learning engineering. Its structure—real-world Kaggle contests, rigorous human baselines, modular agent scaffolds, and transparent metrics—facilitates trustworthy, reproducible evaluation. Open-source releases of datasets, grading scripts, Dockerized scaffolds, and analytical tools are available at https://github.com/openai/mle-bench/. The benchmark is actively maintained with periodic updates to counter pretraining contamination and will continue to serve as a de facto yardstick for autonomous ML engineering research (Toledo et al., 3 Jul 2025, Chan et al., 9 Oct 2024).
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free