InnovatorBench: Autonomous AI Research Benchmark
- InnovatorBench is a standardized benchmark-platform designed to evaluate AI agents' ability to autonomously perform end-to-end LLM research.
- It spans multi-stage workflows across data construction, filtering, augmentation, loss/reward design, and scaffold construction with rigorous evaluation criteria.
- The platform leverages ResearchGym’s advanced tooling and human-in-the-loop refinements to mitigate errors and enhance performance in realistic, long-horizon research tasks.
InnovatorBench is a standardized benchmark-platform pair designed to rigorously evaluate the ability of AI agents—particularly those powered by LLMs—to autonomously conduct innovative, end-to-end LLM research. It addresses the limitations of prior benchmarks that probe only narrow facets of research automation, such as code correctness or experiment reproduction in simplified, single-stage settings. By spanning multi-stage, long-horizon workflows grounded in realistic tooling, InnovatorBench measures whether agents can formulate hypotheses, curate data, design and implement algorithms, execute large-scale experiments, and analyze uncertainty under real resource constraints.
1. Scope and Structure of the Benchmark
InnovatorBench comprises 20 distinct research tasks organized into six key domains:
- Data Construction (DC, 4 tasks): Creating new datasets from raw sources.
- Data Filtering (DF, 3 tasks): Selecting high-quality samples from noisy or oversize datasets.
- Data Augmentation (DA, 5 tasks): Synthesizing new training data via LLM-driven transformations.
- Loss Function Design (LD, 3 tasks): Modifying or creating objective functions for model training.
- Reward Function Design (RD, 2 tasks): Implementing custom reward signals for RL benchmarks.
- Scaffold Construction (SC, 3 tasks): Designing workflows for modular or multimodal research pipelines.
Each task emulates complete AI research cycles—hypothesis generation, code editing, tool invocation, distributed training, result analysis—and typically requires 10–30 hours of active agent engagement on the ResearchGym infrastructure. Tasks leverage a discrete action space with 42 primitive tool calls spanning filesystem manipulations, shell/cluster commands, parsing, web search/browsing, and asynchronous orchestration. The underlying MDP formulation is , where:
- is the set of all possible environment snapshots (e.g., filesystem state, logs, and derived metrics)
- is the set of high-level tool invocations
- denotes a domain-specific, outcome-driven reward using a Kaggle-style scoring function
Each agent trajectory routinely exceeds 300–500 steps, demanding sustained cross-stage memory and adaptive planning.
2. Task Definitions and Success Criteria
Task specifications formalize the research objective, input workspace (conda environment, partial codebase, datasets), optional hints (enabling 80 % of baseline score at reduced innovation penalty), and complete evaluation scripts. Key input-output-success mappings for each category are as follows:
- Data Construction: Inputs are raw corpora and starter code; outputs are formatted datasets with fine-tuning scripts. Success is measured by improved model metrics (e.g., accuracy or BLEU) over baseline.
- Data Filtering: Inputs are noisy datasets; outputs are selected high-quality subsets. Success requires improved downstream accuracy/F1 under controlled finetuning.
- Data Augmentation: Inputs are training data; outputs include synthesized examples via LLM transformations. Success demands statistically significant model improvement.
- Loss Design: Inputs are training loops and losses; outputs are modified code implementing novel loss terms. Success requires stabilization (e.g., gradient collapse avoidance) and improved accuracy.
- Reward Design: Inputs are RL frameworks and preference scores; outputs are custom reward implementations. Success is demonstrated by improved exact-match/preference accuracy.
- Scaffold Construction: Inputs require building multi-step or multimodal pipelines. Outputs are robust, error-free workflow scripts satisfying downstream correctness and metrics.
Agents are evaluated on their ability to autonomously produce runnable artifacts, and their scores are anchored between baseline (score ) and reference solution (score ), without penalizing novel approaches.
3. Evaluation Protocol and Metrics
InnovatorBench employs a Kaggle-style evaluation, permitting up to four submissions per task with instant feedback. Scoring is determined by:
For RL subproblems, policy uncertainty is quantified via average entropy of token logits:
with a combined score:
Weighted averages across the six domains, proportional to their task counts, provide an aggregate score for each agent. Human-in-the-loop protocols (as in Apollo) and agent ablations are compared on final submission score and maximum achieved score .
Experimental Protocol
All tasks are executed in Ubuntu 22.04 Dockerized environments with up to 800 GB RAM, distributed over 8 × 80 GB GPU servers. Restrictions apply to internet access for individual task categories, reflecting realistic research operation.
4. ResearchGym: The Underlying Evaluation Environment
ResearchGym provides sophisticated tooling for agent interaction and distributed workflow execution:
- Primitive Actions: 42 types partitioned into Command, File, Parse, Web Search, and Web Browse families.
- Multi-Machine Orchestration: Agents can dynamically allocate compute resources across CPU/GPU nodes using HTTP job control.
- Asynchronous Execution: Agents can launch background commands, periodically poll job status, retrieve intermediate results, and suspend/resume workflows via snapshot manipulation.
- State Management: Snapshotting preserves workspace integrity, task specifications, and time budgets for resumption or branching exploration.
- Long-Horizon Support: Agents routinely orchestrate multi-hour training/inference cycles, interleaving planning and observation with asynchronous waiting (e.g.
sleep 3600s).
This infrastructure enables authentic emulation of real-world LLM research, incorporating latency, compute contention, and error handling.
5. Agent Architectures and Benchmark Findings
The authors deploy a lightweight ReAct-style wrapper for frontier LLMs, combining explicit reasoning (“think” steps) with actionable tool calls. To address context length constraints, agents periodically generate XML-stylized <state_snapshot> entries via summarization prompts, maintaining coherence across lengthy trajectories.
Four agent backbones are evaluated:
- Claude Sonnet 4
- GPT-5
- GLM-4.5
- Kimi-K2
All use a 5K token context window, with four evaluation attempts per task. Apollo, as introduced in “Interaction as Intelligence Part II,” integrates asynchronous human guidance and action-level data filtering, permitting annotator interventions only upon agent drift, and employing masking to suppress error propagation.
Results Overview
A condensed table of scores (across all 20 tasks):
| Model Variant | S_final | S_best |
|---|---|---|
| GLM-4.5 baseline | 11.85 | 13.35 |
| w/o interaction | 12.66 | 12.97 |
| w/o masking | 18.46 | 18.61 |
| Apollo | 21.86 | 24.01 |
Domain-level gains with Apollo:
- Data Filtering: +681% (5.1640.32)
- Data Collection: +78%
- Loss Design: +182%
- Scaffold Construction: +100%
Apollo delivers an 84.5% improvement over untrained baseline and 72.7% over non-interactive fine-tuning for (vs. the 50% and 28% improvement claims for or domain-averaged subsets).
6. Qualitative Insights, Failure Modes, and Implications
InnovatorBench reveals multiple agent deficiencies and partial solutions:
- Impatience and Resource Mismanagement: Non-interactive agents abort long jobs or misallocate GPU resources, leading to diminished scores; human-in-the-loop guidance mitigates premature job termination and encourages strategic waiting (e.g., extended sleep commands).
- Feedback Utilization: Apollo iteratively refines thresholds based on real evaluation feedback; baselines often rely on unreliable proxy metrics.
- Error Propagation: Action masking in Apollo prevents recurrence of low-quality actions, such as unnecessary library switches.
- Template-Based Reasoning: Frontier models often adopt template-driven rationales, which impedes genuine methodological innovation.
These observations underscore the benchmark’s difficulty and highlight the importance of:
- End-to-end, multi-stage pipeline management under delayed and sparse feedback
- Context summarization for long trajectories
- Lightweight, strategic human interventions
- Tool grounding and adaptive planning in error-prone environments
7. Limitations and Future Directions
InnovatorBench currently encompasses 20 tasks over six domains but omits interdisciplinary and real-laboratory workflows. Agent generalization remains limited—model variance is substantial, and robust meta-planning is needed. Human-AI collaboration is currently not included, though mixed-initiative protocols and optimal intervention budgeting are recommended extensions.
The authors suggest several future directions:
- Expanding metrics to intermediate step success and error frequencies
- Multi-agent collaborative benchmarks for parallel specialization
- Inclusion of scientific domains outside current NLP/ML focus
- Automated rule-based failure correction for self-supervised improvement
A plausible implication is that InnovatorBench, paired with ResearchGym, constitutes a new standard for robust, long-horizon, tool-grounded assessment of autonomous AI research agents, with Apollo’s demonstrated advances validating the necessity of adaptive, human-in-the-loop fine-tuning frameworks.
InnovatorBench and ResearchGym are released at https://github.com/GAIR-NLP/InnovatorBench for continued development and comparative research (Fu et al., 31 Oct 2025, Wu et al., 31 Oct 2025).
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free