Papers
Topics
Authors
Recent
2000 character limit reached

MLE-bench Evaluation Suite

Updated 21 November 2025
  • MLE-bench Evaluation Suite is a benchmarking framework that systematically assesses the machine learning engineering capabilities of autonomous agents using a curated set of Kaggle competitions.
  • It enforces reproducible protocols with defined task complexities, tiered medal-based metrics, and strict hardware and runtime restrictions to ensure fair comparisons.
  • The framework integrates advanced search policies, operator scaffolding, and evaluation techniques such as Monte Carlo Tree Search and evolutionary search to drive automated ML engineering progress.

MLE-bench Evaluation Suite establishes a rigorous and large-scale benchmarking framework to systematically assess the ML engineering capabilities of autonomous agents, particularly those powered by LLMs. Drawing from real‐world Kaggle competitions, MLE-bench evaluates agents across diverse ML tasks, imposing reproducible protocols, hardware constraints, and human-relevant metrics. Its design integrates competitive leaderboard-based success criteria, robust operator/scaffold abstractions for agent code generation, and detailed evaluation practices, positioning it as a critical testbed for progress in automated ML engineering (Chan et al., 9 Oct 2024, Toledo et al., 3 Jul 2025).

1. Benchmark Construction and Task Design

MLE-bench comprises a curated corpus of 75 completed Kaggle competitions spanning a spectrum of ML engineering challenges, with an additional 7 held out for development purposes (Chan et al., 9 Oct 2024). The suite stratifies tasks by:

  • Problem Category: Image classification, NLP, time-series forecasting, tabular regression, segmentation, signal processing, and multimodal learning.
  • Complexity Level:
    • Low (22/75): Solvable in <2 hours by an expert (excluding model training).
    • Medium (38/75): 2–10 hours.
    • High (15/75): >10 hours.

Each competition is framed as a supervised learning problem. Formally, for task kT={1,...,K}k \in T = \{1, ..., K\}, the agent is given an input space XkRdkX_k \subseteq \mathbb{R}^{d_k}, an output space YkY_k, and a dataset split Dk=(Dktrain,Dkval,Dktest)D_k = (D^\text{train}_k, D^\text{val}_k, D^\text{test}_k). The task’s performance metric mk(Y^,Y)m_k(\hat{Y}, Y) is chosen to match the original Kaggle problem setting and normalized to [0,1][0,1] where possible (Toledo et al., 3 Jul 2025).

Agents interact with these tasks through code execution and experiment automation—reading raw files in various formats, preprocessing data, designing models, hyperparameter tuning, and handling long-running scripts with robust error management.

2. Evaluation Metrics and Success Criteria

MLE-bench evaluation adopts a medal-based system mirroring Kaggle leaderboards. For each competition, bronze, silver, and gold thresholds (τkbronzeτksilverτkgold\tau_k^{\text{bronze}} \leq \tau_k^{\text{silver}} \leq \tau_k^{\text{gold}}) are established based on private leaderboard ranks using a tiered lookup scheme (Chan et al., 9 Oct 2024). The core performance indicator is the "any-medal" rate:

AnyMedalRate=1Kk=1KIk(g),\mathrm{AnyMedalRate} = \frac{1}{K}\sum_{k=1}^K I_k(g),

where Ik(g)=1[f(g)τkbronze]I_k(g) = 1[f(g) \geq \tau_k^{\text{bronze}}] and f(g)f(g) denotes the metric value on DktestD^\text{test}_k (Toledo et al., 3 Jul 2025).

Multiple independent runs per task (seeds) are performed to estimate the probability of medalling. The task-level success rate is

SRk=1Ni=1NIk(gi),\mathrm{SR}_k = \frac{1}{N} \sum_{i=1}^N I_k(g_i),

and the aggregate success rate is

SR=1Kk=1KSRk.\mathrm{SR} = \frac{1}{K} \sum_{k=1}^K \mathrm{SR}_k.

To quantify agent robustness, the pass@k metric is employed:

pass@k=1(nck)(nk),\mathrm{pass}@k = 1 - \frac{\binom{n - c}{k}}{\binom{n}{k}},

where cc is the count of medalling runs out of nn (Chan et al., 9 Oct 2024). Confidence intervals are computed via stratified bootstrapping over tasks and seeds (Toledo et al., 3 Jul 2025).

3. Agent Scaffolding, Search Policies, and Operator Design

MLE-bench evaluates agent architectures that automate iterative ML solution development by formalizing them as search policies over the solution space. Each candidate solution ("artifact") is represented as a node vv in a directed search graph Gt=(Vt,Et)G_t = (V_t, E_t), with edges corresponding to operator-induced transformations (Toledo et al., 3 Jul 2025). The framework is parametrized by:

  • FF: Fitness function (validation set performance, 5-fold cross-validation),
  • πsel\pi_\text{sel}: Node selection policy,
  • O={o}O = \{o_\ell\}: Operator set (e.g., Draft, Debug, Improve, Memory, Crossover),
  • πop\pi_\text{op}: Operator selection policy,
  • τ\tau: Termination criterion (time/node budget).

Three principal search policies are instantiated:

  • Greedy (AIDE): Always selects the highest-fitness node, applies Draft until initial drafts exist, then Improve, falling back to Debug.
  • Monte Carlo Tree Search (MCTS): UCT-guided selection,

hUCT(vu)=Q(v)+clogN(u)N(v)+ϵh_{\text{UCT}}(v|u) = Q(v) + c \sqrt{\frac{\log N(u)}{N(v)+\epsilon}}

with leaf nodes evaluated by FF, and value/backpropagation correspond to MCTS conventions.

  • Evolutionary Search: Fitness-proportional parent selection, with reproduction via Improve or Crossover, and debugging applied as needed; offspring replace lowest-fitness individuals.

Operator sets are critical: O_AIDE (baseline) and O_AIRA (enhanced) are compared. Notable O_AIRA features include dynamic prompt complexity cues, scoped memory, and "think tokens" for structured reasoning (Toledo et al., 3 Jul 2025).

4. Experimental Protocol and Infrastructure

MLE-bench enforces strict reproducibility and compute constraints:

  • Dataset access: Only train and validation data are available during search. The test set is used solely for final evaluation.
  • Search execution: Agent code is run in isolated Apptainer (OCI) containers, with access to a full ML stack superimage. Each run is limited to 24 hours wall time per task, with code snippets capped at 4 hours runtime.
  • Hardware per sandbox: 1×H200 GPU, 24 CPUs, 100 GB RAM, 1 TB local storage (Toledo et al., 3 Jul 2025).
  • LLM access: Self-hosted or rate-limited API services.
  • Result reporting: Main analyses are based on ≥10 seeds per task (20 preferred) with stratified bootstrap CIs documented.
  • Frequent checkpointing and infrastructure logs enable fault tolerance (mean time to failure ≈1000 h).

The benchmark supports multiple agent scaffolds (AIDE, MLAB, OpenHands) and LLMs (DeepSeek R1, OpenAI o3, OpenAI o1-preview, GPT-4o, Claude-3.5, Llama-3.1-405B). Agents receive a standardized ~700-token system prompt with competition meta-data and all required resource paths (Chan et al., 9 Oct 2024).

5. Main Results, Analyses, and Insights

Performance on MLE-bench is summarized by the "any-medal" rate. Notable results include:

  • AIDE + o1-preview: 16.9% ± 1.1 points (pass@1)
  • AIDE + GPT-4o: 8.7% ± 0.5 points (pass@1)
  • Performance can double with pass@k: o1-preview reaches ~34% for pass@6.
  • Scaling runtime to 100 hours confers incremental gains but rapidly plateaus (Chan et al., 9 Oct 2024).

Key observations:

  • Operator Bottleneck: With baseline operators (O_AIDE), search policy choice is largely ineffectual—operator expressivity is the limiting factor. Enhanced operators (O_AIRA) with strong search strategies yield substantial performance improvements, raising the medalling rate from 39.6% to 47.7% on MLE-bench lite (Toledo et al., 3 Jul 2025).
  • Generalization Gap: Agents frequently overfit to validation metrics. Oracle selection based on test metrics exposes a 9–13% gap; selecting the top k=3k=3 nodes by validation and reporting the best addresses most of this gap.
  • Variance: Large numbers of seeds are essential to avoid misleading rankings; fewer than 5 seeds per task can yield unstable results.
  • Time-Dependence: Agent rankings evolve over the 24-hour window; non-greedy policies converge and occasionally surpass greedy strategies only after \sim10–19 hours (Toledo et al., 3 Jul 2025).
  • Hardware Scaling: No clear advantage is observed for additional GPU resources, as CPU-only settings achieve comparable medal rates.

Contamination and plagiarism analyses indicate near-zero correlation between LLM familiarity with competition pages and agent performance, and no substantial code overlap with top public notebooks (Chan et al., 9 Oct 2024).

6. Open Source Artifacts and Reproducibility Guidelines

The entire MLE-bench suite, including datasets, grading scripts, agent scaffolding code, and evaluation harnesses, is open-sourced (https://github.com/openai/mle-bench) (Chan et al., 9 Oct 2024). The repository includes:

  • /competitions/: Competition data loaders and grading code.
  • /agents/: Reference agents and scaffolds.
  • /scripts/: Utilities for split preparation, orchestration, and leaderboard snapshotting.
  • /eval/: Scoring harness, medal thresholds, and pass@k calculation.
  • Complete logs, seed repetitions, and code to detect rule violations/plagiarism.

To ensure reproducibility, all random seeds, container images, hardware specs, and LLM versions are catalogued in the infra/CONFIG.md file, with the medal-threshold logic accessible in eval/medals.py.

7. Contextualization and Extensions

Situating MLE-bench within the broader ML benchmarking landscape, PMLB provides an instructive comparison (Olson et al., 2017). PMLB emphasizes systematic dataset curation, meta-feature profiling (instance/feature counts, class imbalance, etc.), and standardized cross-validated evaluation pipelines—principles mirrored and extended in MLE-bench, but MLE-bench elevates the focus to autonomous ML engineering using real‑world, heterogeneous tasks. The meta-feature-aware dataset selection, version-controlled and fully transparent infrastructure, and integration of pass@k and anytime metrics collectively advance best practices for benchmarking complex agent behavior.

This suggests that future directions for MLE-bench may include simulation of missingness, expansion to regression and structured-data tasks, synthetic benchmarks targeting undercovered regions in the meta-feature space, and community-driven extension of the benchmark corpus—as outlined in best practices distilled from both MLE-bench and PMLB experience (Olson et al., 2017, Chan et al., 9 Oct 2024).

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to MLE-bench Evaluation Suite.