FML-bench: Automated ML Research Benchmark

Updated 19 October 2025

FML-bench is a benchmark that rigorously evaluates automatic ML research agents by integrating realistic code scenarios and core ML research challenges.
It employs a unified evaluation framework with metrics for utility, diversity, academic contribution, computational cost, and step success rate in multi-stage research workflows.
Empirical results indicate that agents using broad, diverse exploration strategies outperform those with narrow, linear modification approaches in advancing research.

FML-bench is a benchmark specifically designed for the rigorous evaluation of automatic ML research agents, with an explicit focus on measuring agents' capacity to contribute meaningfully to scientific progress within core machine learning research problems rather than merely executing engineering tasks. Developed to advance both the methodology and practical impact of research agent automation, FML-bench departs from application-centric scripts or narrow leaderboard tasks by embedding fundamental research challenges—from generalization and representation learning to privacy and causality—into realistic, code-centric evaluation scenarios. The benchmark features a unified ecosystem comprised of baseline codebases, extensible integration with genuine GitHub repositories, and a sophisticated multi-faceted evaluation framework centered around academic rigor and diversity of research exploration.

1. Motivation and Benchmark Design Principles

FML-bench was created in response to critical limitations of existing evaluation frameworks for automatic research agents. Prevailing benchmarks often prioritize engineering execution mechanics (e.g., data pipeline construction, hyperparameter optimization) or focus on application-oriented tasks and synthetic leaderboard settings, thereby neglecting dimensions central to academic ML research. This overemphasis leads to two main problems:

Academic Rigor: Agents are not evaluated on their ability to propose and empirically validate original scientific hypotheses or to solve open-ended, fundamental research questions.
Exploration Diversity and Scalability: Existing frameworks rarely assess an agent’s capacity to explore a broad hypothesis space in a scalable manner, and often constrain evaluations to narrowly scoped scripts rather than real-world, multi-stage research repositories.

FML-bench counteracts these limitations via:

Task selection focused on core ML research problems such as generalization, data efficiency, fairness, privacy, causality, and representation learning.
Integration of public GitHub codebases and the acceptance of multi-stage command pipelines to reflect authentic research workflows.
Emphasis on reducing the coding burden while protecting the integrity of evaluation scripts.

2. Unified Evaluation Framework and Metrics

A distinguishing feature of FML-bench is its unified, iterative evaluation process. Here, each automatic research agent operates in a cyclic research workflow: beginning with a codebase, the agent generates a hypothesis, implements it as a code modification, empirically tests the result, and uses performance feedback to guide additional, potentially diverse, modifications.

The evaluation employs five complementary metrics:

Metric	Definition/Goal	Formula/Details
Utility	Task-specific performance improvement per mod.	$U(m, C) = \operatorname{perf}(C \oplus m) - \operatorname{perf}(C)$
Diversity	Breadth of hypotheses/solutions over time (semantic/structural modification space).	$D(\mathcal{H})$ measured over hypotheses $\mathcal{H}$
Academic Contribution Rate	Fraction of scientific/methodological improvements (vs. engineering tweaks).	Mean rate across iterations
Cost	Aggregate computational/API/time resources used per improvement step.	$P_t$ (resource used at step $t$ )
Step Success Rate	Fraction of agent iterations that yield valid, bug-free, executable results.	$S(\mathcal{M}, C_1)$

The overall research agent objective as formalized is:

$\max_{T, \{q_t, m_t\}_{t=1}^T} \sum_{t=1}^T [U_t + \lambda A_t - \eta P_t] + \gamma S(\mathcal{M}, C_1) + \beta D(\mathcal{H})$

subject to constraints: $D(\mathcal{H}) \geq \delta$ , average $A_t \geq \alpha$ , $S \geq \rho$ , $\sum_t P_t \leq B$ , ensuring required exploration, scientific contribution, and resource/budget limits.

3. Task Diversity and Realistic ML Research Scenarios

FML-bench includes eight diverse, foundational machine learning tasks that expose agents to key methodological fronts of modern ML research. These span problems such as:

Generalization under limited supervision
Few-shot learning and transfer
Robustness to distributional shift or data noise
Fairness and bias mitigation
Causal effect estimation and intervention
Representation/feature learning
Privacy-preserving modeling
Data and model efficiency challenges

Each task is delivered as a real codebase with fully specified evaluation commands, appropriate input pipelines, and read-only protection for evaluation files. The system accepts fully general command-lists (not restricted to single scripts), enabling complex experimental designs as encountered in authentic research contexts.

4. Empirical Agent Comparison and Findings

FML-bench has been used to benchmark state-of-the-art automatic research agents. Notably, the benchmark has revealed strong empirical trends regarding research exploration strategies:

Agents employing broad exploration—generating a wide set of diverse hypotheses and code modifications in parallel—consistently outperform agents focusing on narrow, depth-first refinement.
For example, agents such as TheAIScientist (driven by Gemini-2.5-Pro) achieve higher utility and academic contribution rates than agents implementing primarily single-threaded or linear modification strategies (e.g., many CLI-style code assistants).
There is a measurable positive correlation between diversity in the semantic content of code modifications and achieved task performance, strongly advocating for breadth in automated research exploration.

5. Technical Interface and Integration with Research Workflows

Recognizing the heterogeneous nature of real-world ML repositories, FML-bench provides a unified input/output interface:

Codebases are accepted in their native form, regardless of repository structure.
Evaluation is executed via a command-list protocol; multi-stage scripts, evaluation pipelines, and task documentation are all supported.
Post-processing modules standardize diverse output formats (such as accuracy, AUC, error) into the evaluation framework.
Evaluation files are protected read-only, ensuring no tampering of ground-truth metrics.
Constraints and metric computations are enforced via formally defined prompts and LaTeX formulas for maximal reproducibility and precision.

This design facilitates extension to new research codebases and automates translation of outputs from arbitrary software stacks into the common framework.

6. Broader Impact and Future Directions

FML-bench is positioned to fundamentally advance automatic machine learning research by formalizing and automating the evaluation of research agent capabilities beyond engineering execution. Its foundational task suite and rigorous metric design support both upper and lower bounds on autonomous research performance and guide the development of fully automated research assistants.

Potential future directions include:

Expanding the benchmark to encompass emerging research areas and more varied data modalities.
Investigating further dimensions of agent performance, such as long-term knowledge accumulation, meta-learning, and adaptability across domains.
Leveraging the benchmark’s extensible protocol for continuous, community-wide evaluation and iteration.

By supporting continuous, scalable evaluation rooted in academic rigor and authentic research workflows, FML-bench addresses open challenges in automating machine learning research and sets a precedent for principled, extensible benchmarking in this rapidly evolving field (Zou et al., 12 Oct 2025).

PDF Markdown Chat (Pro)

References (1)

FML-bench: A Benchmark for Automatic ML Research Agents Highlighting the Importance of Exploration Breadth (2025)

FML-bench: Automated ML Research Benchmark

1. Motivation and Benchmark Design Principles

2. Unified Evaluation Framework and Metrics

3. Task Diversity and Realistic ML Research Scenarios

4. Empirical Agent Comparison and Findings

5. Technical Interface and Integration with Research Workflows

6. Broader Impact and Future Directions

Whiteboard

Follow Topic

Continue Learning

FML-bench: Automated ML Research Benchmark

1. Motivation and Benchmark Design Principles

2. Unified Evaluation Framework and Metrics

3. Task Diversity and Realistic ML Research Scenarios

4. Empirical Agent Comparison and Findings

5. Technical Interface and Integration with Research Workflows

6. Broader Impact and Future Directions

Sponsor

Whiteboard

Follow Topic

Continue Learning

Related Topics