MLE-bench-30: Autonomous ML Benchmark

Updated 2 April 2026

MLE-bench-30 is a benchmark evaluating ML engineering capabilities of autonomous AI agents using 30 diverse Kaggle tasks.
It standardizes end-to-end ML tasks—including data preparation, model development, experimentation, and submission—with formal evaluation criteria adapted from Kaggle leaderboards.
The benchmark offers practical insights into optimizing operator design, search policies, and reproducibility practices for advanced autonomous ML systems.

MLE-bench-30 is a rigorously curated benchmark designed to assess the machine learning engineering (MLE) capabilities of autonomous AI agents. It comprises a 30-task subset constructed from real-world Kaggle competitions, reflecting substantial diversity in modalities, domains, data regimes, and problem types. By mandating end-to-end engineering—including data preparation, model development, experimentation, and submission under resource and time constraints—and evaluating agents against human leaderboards, MLE-bench-30 provides a challenging and reproducible yardstick for automated ML systems (Chan et al., 2024, Toledo et al., 3 Jul 2025).

1. Design and Construction of MLE-bench-30

MLE-bench-30 is derived from the broader MLE-bench suite, which originally spans 75 Kaggle competitions. The source set began with 5,673 completed competitions (Meta-Kaggle), from which community contests, those lacking documentation, or those infeasible for local evaluation were omitted. Further manual filtering targeted modern ML engineering relevance, dataset accessibility, precision in task specification, and reproducibility of evaluation code.

The 30-task subset was selected to maximize diversity across key axes:

Problem types: binary/multi-class classification, regression, time-series forecasting, and segmentation;
Data modalities: tabular (10 tasks), vision (8), NLP (6), audio/time-series (6);
Dataset scales: ranging from tiny (144 samples) to huge (up to 55 million rows);
Domains: spanning healthcare, geospatial, audio signal processing, materials science, recommendation systems, among others.

A balanced split by estimated complexity—10 low, 15 medium, 5 high (as measured by expert time-to-solution estimates)—was maintained. The benchmark ensures that bronze, silver, and gold medal thresholds, adapted from actual Kaggle leaderboards, span a meaningful spectrum of task difficulty.

2. Task Specifications and Modalities

Each of the 30 benchmark tasks is accompanied by:

A formal statement of the problem and its category (e.g., "image–binary classification");
Dataset provenance and detailed preprocessing directives;
Explicit input/output (I/O) formats typical of Kaggle, such as CSV submissions with specified column names;
A task-specific, formally defined evaluation metric.

Representative example tasks include:

ID	Competition Name	Category	#Samples	Metric
T1	aerial-cactus-identification	Image–BinClass	21,500	Accuracy
T6	new-york-city-taxi-fare-prediction	Tabular–Regression	55M (train)	RMSE
T7	ventilator-pressure-prediction	TimeSeries–Regression	10M	MAE
T4	jigsaw-toxic-comment-classification	Text–MultiLabel	312,735	ROC AUC (macro)
T30	stanford-covid-vaccine	Tabular–Regression	2,400	Log-MAE

Complete task details, including training/test splits, domain, and selected metrics, are documented in the open-source repository. Metrics are provided in formal notation (e.g., $\mathrm{RMSE}$ , $\mathrm{AUC}$ , etc.), ensuring that evaluation is both precise and reproducible.

3. Formal Evaluation Protocol and “Kaggle Medal” Criteria

Each agent submission for a task is evaluated using the officially published Kaggle private leaderboard for that competition:

Score computation replicates Kaggle’s grading pipeline locally.
Medal allocation uses percentile-based thresholds:
- Gold: score ranks in the top 10% (for $T<100$ teams) or top $10+0.2\%$ (for $T\ge250$ ),
- Silver: top 20% (for $T<250$ ) or top 5% (for $T\ge1000$ ),
- Bronze: top 40% or top 10%, with exact cutoffs as per Kaggle conventions.

Formally, for task $t$ with test metric $f^{t}_\text{test}(s^{t})$ and thresholds $b^{t}, s^{t}, g^{t}$ for bronze, silver, gold:

$\mathrm{AUC}$ 0

For the entire 30-task suite, headline aggregation metrics include:

Any-medal rate: $\mathrm{AUC}$ 1
Silver-or-above rate: $\mathrm{AUC}$ 2
Gold rate: $\mathrm{AUC}$ 3

All evaluations are aggregated over multiple random seeds per task, with confidence intervals estimated by stratified bootstrapping.

4. Agent Architectures, Operator Sets, and Search Policies

MLE-bench-30 is agnostic to agent implementation but has been used to evaluate various LLM-based and search-policy-based agents. The benchmark provides a formal environment for graph-based search algorithms defined by:

Search Graph $\mathrm{AUC}$ 4: $\mathrm{AUC}$ 5 is the set of explored ML artifacts (e.g., notebooks, code snapshots), $\mathrm{AUC}$ 6 records parent-child relationships arising via operators.
Iteration Process:

Selection: choose nodes $\mathrm{AUC}$ 7 via policy $\mathrm{AUC}$ 8.
Operator: apply $\mathrm{AUC}$ 9 via policy $T<100$ 0.
Expansion: $T<100$ 1, attached to $T<100$ 2.
Fitness: evaluate $T<100$ 3 as proxy score (usually 5-fold CV on $T<100$ 4).
Repeat under resource constraints.

Supported operator sets include:

AIDE: {Draft, Debug, Improve, Memory}
AIRA (augmented): adds prompt-adaptive complexity cues, scoped memory, chain-of-thought (“think-tokens”), and a Crossover operator. Operator quality is a critical performance bottleneck; advanced searches (e.g., MCTS, evolution) only outperform greedy policies when operators can reliably generate diverse and valid child artifacts.

Typical search policies include greedy best-first (AIDE), Monte Carlo Tree Search, and evolutionary algorithms, each with tuning for robustness and exploration.

5. Experimental Results and Analyses

Evaluation with frontier LLMs, using AIDE scaffolding, demonstrates the difficulty of the benchmark. Key empirical findings on MLE-bench-30 (mean ± SE over 36 seeds) include:

Model	Bronze	Silver	Gold	Any Medal
o1-preview + AIDE	3.4%	4.1%	9.4%	16.9%
GPT-4o + AIDE	1.6%	2.2%	5.0%	8.7%
Llama-3.1 405B + AIDE	0.0%	1.3%	1.7%	3.0%
Claude-3.5 Sonnet	0.9%	2.2%	4.4%	7.6%

Pass@k scaling experiments show that the probability of earning any medal increases from 16.9% at $T<100$ 5 to approximately 34.1% at $T<100$ 6 for o1-preview + AIDE, reflecting the value of repeated attempts.

Resource scaling shows modest gains; increasing compute from CPU-only to 2 GPUs raises any-medal rate from 8.7% to 10.2%. Time scaling (extending task duration from 24h to 100h) improves medal rate from 8.7% to 11.8%.

Plagiarism and contamination analyses (measuring LLM familiarity with competition pages and using obfuscated descriptions and automated plagiarism checks) detect no significant contamination or plagiarism. No submissions exhibited >60% similarity to prior top notebooks.

6. Reproducibility, Infrastructure, and Best Practices

Reproducibility is maintained by open-sourcing all competition specifications, data splits, grading scripts, and agent wrappers. Execution occurs in containerized environments (Docker), enforcing resource and network constraints: 1 NVIDIA H200 GPU, 24 CPUs, 100GB RAM, 1TB scratch, no inbound internet except for package/model downloads, and 24-hour wall-clock budget per task.

The canonical reproduction workflow:

Clone the MLE-bench repository and install Docker.
Build the container, select the 30-task subset via a benchmark script flag.
Configure API keys and compute resources.
Launch agents specifying model, time budget, seeds, and task subset.
Results—medal counts and raw scores—are logged per task/seed for further analysis.

Best practice recommendations include stratified bootstrapping with at least 10 seeds/task, clear reporting of all medal-tier aggregation metrics, and infrastructure for job isolation and checkpoint recovery. These are necessary for comparability and robustness in benchmarking agentic ML systems (Chan et al., 2024, Toledo et al., 3 Jul 2025).

7. Research Insights and Recommendations

Key insights from systematic evaluation on MLE-bench-30 are:

Operator set quality governs performance: Upgrading operator diversity and reliability (as in AIRA over AIDE) allows search policies such as MCTS and evolutionary algorithms to yield significant gains.
Joint design of search and operators is necessary: Exploration-heavy search gives benefit only if operators can furnish both diversity and validity; enhancements like scoped memory and explicit complexity cues counter early mode collapse.
Addressing generalization gap: Agents optimize validation metrics but are scored on hidden test sets; a simple strategy of selecting among the top-k validation nodes ( $T<100$ 7) corrects up to 75% of test/validation generalization gap.
Medal-based metrics yield real-world comparability: Mapping performance to Kaggle medals contextualizes agent proficiency relative to expert human teams.
Infrastructure for isolation and reproducibility is critical: Containerized execution, self-hosted LLMs, and checkpointing are necessary for fair, scalable evaluation.

A plausible implication is that advances in operator design and search policy co-optimization are prerequisite for substantial agentic gains on difficult MLE benchmarks. The comprehensiveness and granularity of MLE-bench-30 provide a robust platform for future research in fully autonomous ML engineering (Chan et al., 2024, Toledo et al., 3 Jul 2025).

Markdown Report Issue Upgrade to Chat

References (2)

MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering (2024)

AI Research Agents for Machine Learning: Search, Exploration, and Generalization in MLE-bench (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MLE-bench-30 Benchmark.