Papers
Topics
Authors
Recent
Search
2000 character limit reached

AIRS-Bench: Autonomous Research Agent Benchmark

Updated 10 February 2026
  • AIRS-Bench is an open benchmark suite designed to evaluate autonomous LLM research agents across the entire research lifecycle with 20 tasks sourced from modern ML literature.
  • It standardizes evaluation through harmonized metrics and modular task structures, facilitating rigorous comparisons across diverse agentic frameworks and scaffolds.
  • The benchmark provides open-source code, extensive task definitions, and pipelines to reproducibly analyze agent performance against human state-of-the-art baselines.

AIRS-Bench is an open benchmark suite designed to systematically evaluate autonomous LLM–based research agents on a spectrum of tasks drawn from state-of-the-art machine learning publications. Its primary goal is to measure agentic capabilities across the entire research lifecycle—including the generation of research ideas, design and implementation of methodologies, experimental analysis, and iterative refinement—within a standardized and extensible ecosystem. AIRS-Bench comprises 20 tasks spanning seven distinct domains, provides harmonized evaluation metrics and theoretical optimal targets, and utilizes multiple agentic frameworks and scaffolds for rigorous comparative assessment. Its resources, including code and task definitions, are open-sourced to foster progress in the development and understanding of AI-driven scientific research agents (Lupidi et al., 6 Feb 2026).

1. Task Suite and Domain Coverage

AIRS-Bench covers 20 tasks selected from contemporary machine learning literature, reflecting a cross-section of canonical and frontier research problems. These tasks are partitioned into the following seven domains:

Domain Example Task(s) Metric(s)
Code CodeGeneration APPS, CodeRetrieval CodeXGlue Pass@5, MRR
Mathematics MathQuestionAnswering SVAMP Accuracy
Molecular & Protein Modeling QM9 property regression, ZINC graph regress. MAE
Text Classification SentimentAnalysis Yelp, TextualClassification SICK Accuracy
Text Extraction & Matching CoreferenceResolution WSC/Winogrande, TextualSimilarity SICK Accuracy, Spearman ρ
Question Answering SQuAD, DuoRC, Eli5, FinQA EM, Accuracy, ROUGE-1
Time-Series Forecasting WebTraffic (MASE), Rideshare (MAE), SolarWeekly (MAE) Error metrics

Each task is specified by a {problem, dataset, metric} tuple, and typically includes input/output formats, a canonical dataset, and a primary evaluation metric with an explicit theoretical optimum (stopts^\mathrm{opt}_t) and a published human state-of-the-art (SOTA) baseline.

2. Task Format and Integration

The standardization of task structure is crucial for broad accessibility and reproducibility. Each AIRS-Bench task is distributed in a dedicated directory containing:

  • metadata.yaml: Details the problem description, dataset reference, metric formula, SOTA baseline, and optimal score.
  • project_description.md: A human-readable summary of the task and relevant context.
  • prepare.py / evaluate_prepare.py: Scripts for preprocessing data and preparing inputs.
  • evaluate.py: Contains the metric implementation for result scoring.
  • utils.py (optional): Auxiliary utilities.
  • data/: Pre-partitioned train and test splits.

This modular design allows for straightforward integration into multiple agentic harnesses, such as AIRA-dojo and MLGym. Automated converters facilitate task onboarding into new agent evaluation pipelines (Lupidi et al., 6 Feb 2026).

3. Agentic Framework: LLM+Scaffold Architecture

AIRS-Bench evaluates "agents" defined as (LLM + scaffold) pairs, where:

  • LLM provides the probabilistic core (e.g., GPT-4o, gpt-oss, CWM, o3-mini, Devstral).
  • Scaffold orchestrates multi-step, test-time search using composable operators (Draft, Debug, Improve, etc.) and search policies (greedy, MCTS, evolutionary), enabling cycles of program synthesis, test-driven debugging, and iterative improvement.

Frameworks support both sequential (e.g., ReAct) and parallel (e.g., population search, MCTS) scaffolds. Agents emulate the scientific method through an iterative process: draft a solution, experimentally validate via code execution and metric calculation, analyze failures, and refine accordingly. This design supports both one-shot and multi-stage agentic evaluation.

4. Evaluation Metrics, Scoring, and Theoretical Ceilings

Evaluation is metric-specific and tasks are directly comparable via:

  • Raw task metric (e.g., accuracy, MAE, ROUGE-1, MRR, Spearman ρ)
  • Normalized Score (NS):

NSta=ϕt(sta)ϕt(stmin)ϕt(stsota)ϕt(stmin)whereϕt(s)=log10sstopt\mathrm{NS}_t^a = \frac{\phi_t(s_t^a) - \phi_t(s_t^\mathrm{min})}{\phi_t(s_t^\mathrm{sota}) - \phi_t(s_t^\mathrm{min})} \quad \text{where} \quad \phi_t(s) = -\log_{10}|s - s_t^\mathrm{opt}|

NS calibrates a method’s performance relative to the theoretical ceiling and human SOTA for each task.

  • Valid Submission Rate (VSR): Proportion of runs producing valid outputs (format-compliant with numerical scores).
  • Elo Rating: Agents, together with the "human SOTA" baseline, are scored in a comparative Elo rating system based on performance head-to-head on each task.

Performance is evaluated over multiple independent seeds and uniform computational budgets (24 h per agent per task on a H200 GPU).

5. Baseline Results and Observed Agent Limitations

Testing 14 agent/scaffold configurations demonstrates that, on aggregate, current agents achieve a mean NS of 23.4% and a mean VSR of 58.8%. Only 4 out of 20 tasks have seen agents surpass the human SOTA:

  1. TextualClassification SICK (0.93 vs 0.90 SOTA) using a 2-model stacked ensemble.
  2. TextualSimilarity SICK (0.89 vs 0.85) by RoBERTa/Sentence-BERT ensembling.
  3. CoreferenceResolution Winogrande (0.88 vs 0.85) with DeBERTa-v3 finetuning.
  4. Rideshare forecasting (MAE 1.153 vs SOTA 1.185) with a bidirectional GRU ensemble.

On 16 tasks, frontier agents still fall short of human SOTA, and none achieve the theoretical optimum. Key failure modes include formatting/submission errors, context length overflow for long tasks, cumulative code drift in extended scaffolds, and insufficient exploration in complex agentic workflows (Lupidi et al., 6 Feb 2026).

6. Benchmarking Infrastructure and APIs

AIRS-Bench’s open-source repository (github.com/facebookresearch/airs-bench) includes:

  • Full task suite with all necessary artifacts for independent use and extension.
  • Harness adapters for agentic environments (AIRA-dojo, MLGym).
  • Pipeline scripts (run_benchmark.py, aggregate_results.py, plot_metrics.py) for running agents, collecting per-seed metrics, aggregating results, and generating visualizations.
  • Modular evaluators for all task metrics and normalization.

The architecture facilitates both large-scale benchmarking runs and streamlined per-task agent development and evaluation.

7. Future Directions and Open Challenges

AIRS-Bench identifies several avenues for future work:

  • Expansion of task domains to physics, chemistry, and broader scientific areas.
  • Improved scaffolds, especially hybrid search paradigms (e.g., combining MCTS with domain heuristics).
  • Modular toolkits for richer environment feedback and error validation.
  • Automated task onboarding pipelines to reduce the human bottleneck in adding new benchmark tasks.
  • Development of a unified, machine-readable registry for SOTA results and theoretical ceilings.
  • Incorporation of automated literature retrieval as part of agent ideation.
  • Investigation into adaptive compute/time budgets, allowing for dynamic creativity versus efficiency tradeoffs.

These efforts aim to address both methodological and infrastructural gaps, driving toward benchmarks that catalyze further progress in autonomous scientific research agents (Lupidi et al., 6 Feb 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to AIRS-Bench.