MLE-bench Evaluation Suite
- MLE-bench Evaluation Suite is a benchmarking framework that systematically assesses the machine learning engineering capabilities of autonomous agents using a curated set of Kaggle competitions.
- It enforces reproducible protocols with defined task complexities, tiered medal-based metrics, and strict hardware and runtime restrictions to ensure fair comparisons.
- The framework integrates advanced search policies, operator scaffolding, and evaluation techniques such as Monte Carlo Tree Search and evolutionary search to drive automated ML engineering progress.
MLE-bench Evaluation Suite establishes a rigorous and large-scale benchmarking framework to systematically assess the ML engineering capabilities of autonomous agents, particularly those powered by LLMs. Drawing from real‐world Kaggle competitions, MLE-bench evaluates agents across diverse ML tasks, imposing reproducible protocols, hardware constraints, and human-relevant metrics. Its design integrates competitive leaderboard-based success criteria, robust operator/scaffold abstractions for agent code generation, and detailed evaluation practices, positioning it as a critical testbed for progress in automated ML engineering (Chan et al., 9 Oct 2024, Toledo et al., 3 Jul 2025).
1. Benchmark Construction and Task Design
MLE-bench comprises a curated corpus of 75 completed Kaggle competitions spanning a spectrum of ML engineering challenges, with an additional 7 held out for development purposes (Chan et al., 9 Oct 2024). The suite stratifies tasks by:
- Problem Category: Image classification, NLP, time-series forecasting, tabular regression, segmentation, signal processing, and multimodal learning.
- Complexity Level:
- Low (22/75): Solvable in <2 hours by an expert (excluding model training).
- Medium (38/75): 2–10 hours.
- High (15/75): >10 hours.
Each competition is framed as a supervised learning problem. Formally, for task , the agent is given an input space , an output space , and a dataset split . The task’s performance metric is chosen to match the original Kaggle problem setting and normalized to where possible (Toledo et al., 3 Jul 2025).
Agents interact with these tasks through code execution and experiment automation—reading raw files in various formats, preprocessing data, designing models, hyperparameter tuning, and handling long-running scripts with robust error management.
2. Evaluation Metrics and Success Criteria
MLE-bench evaluation adopts a medal-based system mirroring Kaggle leaderboards. For each competition, bronze, silver, and gold thresholds () are established based on private leaderboard ranks using a tiered lookup scheme (Chan et al., 9 Oct 2024). The core performance indicator is the "any-medal" rate:
where and denotes the metric value on (Toledo et al., 3 Jul 2025).
Multiple independent runs per task (seeds) are performed to estimate the probability of medalling. The task-level success rate is
and the aggregate success rate is
To quantify agent robustness, the pass@k metric is employed:
where is the count of medalling runs out of (Chan et al., 9 Oct 2024). Confidence intervals are computed via stratified bootstrapping over tasks and seeds (Toledo et al., 3 Jul 2025).
3. Agent Scaffolding, Search Policies, and Operator Design
MLE-bench evaluates agent architectures that automate iterative ML solution development by formalizing them as search policies over the solution space. Each candidate solution ("artifact") is represented as a node in a directed search graph , with edges corresponding to operator-induced transformations (Toledo et al., 3 Jul 2025). The framework is parametrized by:
- : Fitness function (validation set performance, 5-fold cross-validation),
- : Node selection policy,
- : Operator set (e.g., Draft, Debug, Improve, Memory, Crossover),
- : Operator selection policy,
- : Termination criterion (time/node budget).
Three principal search policies are instantiated:
- Greedy (AIDE): Always selects the highest-fitness node, applies Draft until initial drafts exist, then Improve, falling back to Debug.
- Monte Carlo Tree Search (MCTS): UCT-guided selection,
with leaf nodes evaluated by , and value/backpropagation correspond to MCTS conventions.
- Evolutionary Search: Fitness-proportional parent selection, with reproduction via Improve or Crossover, and debugging applied as needed; offspring replace lowest-fitness individuals.
Operator sets are critical: O_AIDE (baseline) and O_AIRA (enhanced) are compared. Notable O_AIRA features include dynamic prompt complexity cues, scoped memory, and "think tokens" for structured reasoning (Toledo et al., 3 Jul 2025).
4. Experimental Protocol and Infrastructure
MLE-bench enforces strict reproducibility and compute constraints:
- Dataset access: Only train and validation data are available during search. The test set is used solely for final evaluation.
- Search execution: Agent code is run in isolated Apptainer (OCI) containers, with access to a full ML stack superimage. Each run is limited to 24 hours wall time per task, with code snippets capped at 4 hours runtime.
- Hardware per sandbox: 1×H200 GPU, 24 CPUs, 100 GB RAM, 1 TB local storage (Toledo et al., 3 Jul 2025).
- LLM access: Self-hosted or rate-limited API services.
- Result reporting: Main analyses are based on ≥10 seeds per task (20 preferred) with stratified bootstrap CIs documented.
- Frequent checkpointing and infrastructure logs enable fault tolerance (mean time to failure ≈1000 h).
The benchmark supports multiple agent scaffolds (AIDE, MLAB, OpenHands) and LLMs (DeepSeek R1, OpenAI o3, OpenAI o1-preview, GPT-4o, Claude-3.5, Llama-3.1-405B). Agents receive a standardized ~700-token system prompt with competition meta-data and all required resource paths (Chan et al., 9 Oct 2024).
5. Main Results, Analyses, and Insights
Performance on MLE-bench is summarized by the "any-medal" rate. Notable results include:
- AIDE + o1-preview: 16.9% ± 1.1 points (pass@1)
- AIDE + GPT-4o: 8.7% ± 0.5 points (pass@1)
- Performance can double with pass@k: o1-preview reaches ~34% for pass@6.
- Scaling runtime to 100 hours confers incremental gains but rapidly plateaus (Chan et al., 9 Oct 2024).
Key observations:
- Operator Bottleneck: With baseline operators (O_AIDE), search policy choice is largely ineffectual—operator expressivity is the limiting factor. Enhanced operators (O_AIRA) with strong search strategies yield substantial performance improvements, raising the medalling rate from 39.6% to 47.7% on MLE-bench lite (Toledo et al., 3 Jul 2025).
- Generalization Gap: Agents frequently overfit to validation metrics. Oracle selection based on test metrics exposes a 9–13% gap; selecting the top nodes by validation and reporting the best addresses most of this gap.
- Variance: Large numbers of seeds are essential to avoid misleading rankings; fewer than 5 seeds per task can yield unstable results.
- Time-Dependence: Agent rankings evolve over the 24-hour window; non-greedy policies converge and occasionally surpass greedy strategies only after 10–19 hours (Toledo et al., 3 Jul 2025).
- Hardware Scaling: No clear advantage is observed for additional GPU resources, as CPU-only settings achieve comparable medal rates.
Contamination and plagiarism analyses indicate near-zero correlation between LLM familiarity with competition pages and agent performance, and no substantial code overlap with top public notebooks (Chan et al., 9 Oct 2024).
6. Open Source Artifacts and Reproducibility Guidelines
The entire MLE-bench suite, including datasets, grading scripts, agent scaffolding code, and evaluation harnesses, is open-sourced (https://github.com/openai/mle-bench) (Chan et al., 9 Oct 2024). The repository includes:
/competitions/: Competition data loaders and grading code./agents/: Reference agents and scaffolds./scripts/: Utilities for split preparation, orchestration, and leaderboard snapshotting./eval/: Scoring harness, medal thresholds, and pass@k calculation.- Complete logs, seed repetitions, and code to detect rule violations/plagiarism.
To ensure reproducibility, all random seeds, container images, hardware specs, and LLM versions are catalogued in the infra/CONFIG.md file, with the medal-threshold logic accessible in eval/medals.py.
7. Contextualization and Extensions
Situating MLE-bench within the broader ML benchmarking landscape, PMLB provides an instructive comparison (Olson et al., 2017). PMLB emphasizes systematic dataset curation, meta-feature profiling (instance/feature counts, class imbalance, etc.), and standardized cross-validated evaluation pipelines—principles mirrored and extended in MLE-bench, but MLE-bench elevates the focus to autonomous ML engineering using real‑world, heterogeneous tasks. The meta-feature-aware dataset selection, version-controlled and fully transparent infrastructure, and integration of pass@k and anytime metrics collectively advance best practices for benchmarking complex agent behavior.
This suggests that future directions for MLE-bench may include simulation of missingness, expansion to regression and structured-data tasks, synthetic benchmarks targeting undercovered regions in the meta-feature space, and community-driven extension of the benchmark corpus—as outlined in best practices distilled from both MLE-bench and PMLB experience (Olson et al., 2017, Chan et al., 9 Oct 2024).
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free