MLE-bench: Autonomous ML Engineering Benchmark

Updated 7 July 2025

MLE-bench is a benchmark framework that tests AI and LLM agents by executing complete ML workflows on curated Kaggle competitions.
It employs structured scaffolding, human-referenced leaderboards, and rigorous resource controls to ensure replicability and practical relevance.
The framework has revealed key insights into agent limitations and spurred extensions like interactive RL environments and search-driven refinements.

MLE-bench is a benchmark framework and curated task suite designed to systematically evaluate the capabilities of AI agents—including LLM-based agents—at end-to-end machine learning engineering. Unlike synthetic code generation or algorithmic reasoning benchmarks, MLE-bench directly tests ML autonomy within real-world settings by leveraging a collection of diverse Kaggle competitions, specialized agent scaffolding, human-referenced leaderboards, and rigorous resource controls. The benchmark has rapidly become a focal point for research on automated machine learning, revealing key limitations and advances of LLM-driven agents and spawning a suite of companion methods (2410.07095, 2505.07782, 2506.15692, 2507.02554).

1. Design Principles and Composition

MLE-bench was constructed to measure the ability of AI and LLM agents to execute the complete workflow of machine learning engineering—encompassing data acquisition, preprocessing, modeling, validation, iterative debugging, and competitive submission (2410.07095). Its core consists of 75 hand-selected ML competitions sourced from the Kaggle platform, chosen to ensure:

Diversity across domains (tabular, image, NLP, multimodal, sequence) and tasks (classification, regression, ranking, and multi-class).
Replicability, by reconstructing private test splits and grading code to closely match real Kaggle leaderboards.
Relevance to modern ML practices, with tasks requiring genuine engineering beyond toy pipelines.

Each task package features:

The original competition description and data schema.
Curated datasets with private test splits.
Reference grading/evaluation code that implements the original metric.

The benchmark codebase is fully open-sourced to support transparent experimentation and reproducibility (see [github.com/openai/mle-bench/]) (2410.07095).

2. Evaluation Methodology and Metrics

MLE-bench employs an agent-driven evaluation pipeline:

Human Baselines: Leaderboards from Kaggle serve as a reference for human performance, defining percentiles for bronze, silver, and gold medals per competition.
Agent Execution: LLM or hybrid agents run in dockerized Ubuntu 20.04 containers, typically constrained to 36 CPUs, 440GB RAM, and a single A10 (24GB) GPU; each run lasts up to 24 hours.
Scaffolding: Agents operate via open-source agent scaffolds—such as AIDE, MLAB, and OpenHands—providing orchestration for multi-step solution planning, iterative debugging, and submission validation.
Multi-Seed Runs: Each experiment is repeated across multiple random seeds to account for non-determinism in agent sampling, resource contention, and LLM generation.
Performance Metrics:
- pass@k: The pass@k score estimates the proportion of competitions in which an agent secures a medal in k independent attempts:
$\text{pass@k} = \mathbb{E}\left[1 - \frac{\binom{n-c}{k}}{\binom{n}{k}}\right]$

where $n$ is the number of attempts and $c$ is the number of successful ones (2410.07095). - Medal Success Rate: The percentage of tasks on which an agent's submission achieves at least bronze, silver, or gold. - AboveMedian Rate: Frequency with which an agent surpasses the human median score among competitors (2506.15692). - HumanRank Score: $s = 1 - (p/N)$ for ranking position $p$ out of $N$ entrants; used for normalizing performance (2505.07782). - Other Metrics: Area under the (performance) profile (AUP) and Elo scores (for pairwise evaluation among agent solutions) (2505.07782).

This methodology enables fair, transparent, and statistically robust head-to-head comparisons among agents, humans, and across repeated runs.

3. Findings, Baseline Results, and Resource Scaling

Initial evaluations using MLE-bench highlight that proficiency at ML engineering remains a significant challenge for modern LLM agents, with strong but modest baselines established by foundation models and scaffolds:

OpenAI o1-preview + AIDE achieved a medal in approximately 16.9% of competitions on single runs (pass@1), improving to higher rates with increased attempts (pass@8) (2410.07095).
GPT-4o with AIDE scaffold scored medals in about 8.7% of tasks on first attempt, with additional attempts nearly doubling performance. This shows improvement with more time and repeated exploration (2410.07095).
Resource Scaling: Experiments increasing compute (CPU-only, additional GPUs) revealed only marginal or no improvement in performance, indicating that agent strategies under current scaffolds do not saturate available hardware or benefit from parallelization (2410.07095).
Runtime Scaling: Extending wall-clock time beyond 24 hours allows iterative approaches to gradually improve, but many gains accrue during early exploration—subsequent improvements taper off due to agent idling, crash recovery, or local minima in solution quality (2410.07095).

Moreover, the benchmark exposed key bottlenecks—agents frequently fail due to errors in data pipeline setup, incorrect submission formatting, lack of robust debugging routines, and limitations in long-horizon planning or dependency management.

4. Extensions: MLE-Dojo, MLE-STAR, and Search Policy Investigations

MLE-bench has catalyzed a new lineage of research platforms and agent methodologies.

MLE-Dojo: Interactive Environments

MLE-Dojo (2505.07782) transforms the static offline evaluation of MLE-bench into a fully interactive, Gym-style reinforcement learning environment. Key characteristics include:

Framing each competition as a partially observable Markov decision process (POMDP), enabling agents to iteratively request information, validate/execute code, and adapt via structured reward (based on human leaderboard positions).
Allowing trajectory sampling under both supervised and reinforcement learning, thus supporting explicit agent training and not just evaluation.
Covering over 200 real Kaggle competitions, ensuring a broad and varied task set.
Open-sourcing the framework, encouraging ongoing agent development, live leaderboard submission, and reproducibility in experiments.

This interactive, feedback-rich setting aligns more closely with the iterative engineering behaviors observed in expert human practitioners (2505.07782).

MLE-STAR (2506.15692) advances agent design through two methodological innovations:

Web-based Model Retrieval: Agents query a search engine for recent, relevant models and competitive code—mitigating LLMs' temporal knowledge lag and unlocking fresh strategies.
Targeted Component Refinement: Using ablation studies, the agent isolates and incrementally refines high-impact code blocks (e.g., feature engineering sections), instead of refactoring the entire pipeline at each step.
Novel Ensembling: Multiple independently-refined candidates are combined using agent-generated ensemble schemes (averaging, stacking), often yielding higher robustness and performance.
Performance: MLE-STAR achieves medals in 44% of MLE-bench tasks, significantly surpassing earlier agents such as AIDE or DS-Agent (e.g., 25.8% medal rate for AIDE vs 43.9% for MLE-STAR with Gemini-2.0-Flash).

This agent design demonstrates that external retrieval, coupled with fine-grained, component-wise exploration, is critical for maximizing achievable ML engineering performance (2506.15692).

Search Strategies and Operator Sets

Recent work (2507.02554) formalizes AI research agents as search policies using operator sets over candidate solution spaces:

Search Policies: Greedy, Monte Carlo Tree Search (MCTS), and Evolutionary algorithms differ in selection (π_sel), exploration-exploitation tradeoffs, and propagation of fitness estimates.
Operators: The AIRA operator set introduces complexity-adaptive code generation, scoped memory, and structured reflection, outperforming the original AIDE operator set.
Results: The combination of advanced operator sets (AIRA) and non-greedy search (e.g., MCTS or evolutionary) achieves up to 47.7% success in MLE-bench Lite, compared to 39.6% for baseline greedy+AIDE. The interplay between search and operator design is critical for practical advances—operator set improvements alone do not guarantee success without complementary search algorithms.
Generalization: Analyses reveal a persistent validation/test generalization gap of 9–13%, highlighting the limitations of proxy metrics and the need for more robust selection strategies.

The paper underscores the necessity of jointly optimizing search strategy, operator design, and evaluation protocol in developing future AI ML research agents (2507.02554).

5. Implications, Limitations, and Future Research

MLE-bench and its companion platforms have clarified the capabilities and obstacles confronting autonomous ML agents:

Agentic Limitation: Even top-performing LLM-based agents routinely miss pipeline errors, struggle with long-horizon code dependencies, and overfit to validation metrics, indicating fundamental gaps in agentic reasoning and robustness.
Resource Utilization: Current strategies do not fully exploit available computational resources, suggesting room for innovation in distributed search, parallelism, and smarter resource scheduling (2410.07095, 2507.02554).
Contamination and Memorization: Empirical checks found little evidence that leaderboard performance is biased by pre-training on Kaggle competition content, yet this remains a concern as the field progresses (2410.07095, 2507.02554).
Reproducibility and Extensibility: Open-sourcing all benchmarks, scaffolds, and evaluations has fostered community-driven advances and ensured reproducibility, standardization, and shared challenges.

Promising future directions include:

Integrating live-retrieval modules and dynamic reasoning agents ("agentic operators") into core search loops (2507.02554).
Developing robust methods for bridging validation/test generalization gaps.
Scaling experiments to larger and more dynamic task archives to continually challenge agents and prevent overfitting or contamination.
Creating more interactive and real-time feedback environments for robust RL-driven agent training (2505.07782).
Formalizing evaluation metrics for multi-step, open-ended, and adversarial ML engineering scenarios.

6. Summary Table: Selected Agents and Benchmark Outcomes

Agent / Method	Key Strategy	Medal Rate on MLE-bench (%)
AIDE + OpenAI o1-preview	Scaffolded LLM with iterative search/debug	~16.9 (pass@1, (2410.07095))
MLE-STAR (Gemini-2.0)	External search + ablation-guided refinement + ensembling	43.9 (2506.15692)
AIRA + MCTS/Evolutionary	Advanced operator set; MCTS or evolutionary search	47.7 (Lite, (2507.02554))
AIDE (baseline)	Greedy search, static operator set	25.8–39.6

*Performance rates are as reported under each paper's evaluation protocol and may differ depending on full vs. Lite subsets and model/hardware variations.

7. Broader Impact

MLE-bench has set a rigorous standard for benchmarking autonomous ML engineering, elevated the expectations for AI-driven productivity, and provided critical infrastructure for comparative research. By shifting the evaluation focus from algorithmic code snippets to holistic, real-world engineering deliverables, it bridges the gap between AI research progress and its tangible impact on ML practice, and shapes the broader discourse on AI preparedness and responsible scaling (2410.07095). Future research and competitive development are likely to focus on even richer agent architectures, dynamic task generation, long-horizon solution planning, and integration with live ML systems.