AstaBench: AI Science Research Benchmark

Updated 29 October 2025

AstaBench is a comprehensive benchmark suite that evaluates AI agentic abilities in performing diverse scientific research tasks using real-world queries.
It employs controlled, production-grade environments with standardized task formats and explicit cost controls to ensure reproducibility and fair comparison.
Empirical findings reveal key trade-offs between performance, engineering effort, and cost-efficiency, highlighting challenges in achieving holistic AI science assistance.

AstaBench is a comprehensive benchmark suite developed to rigorously evaluate the agentic abilities of AI systems in performing scientific research tasks. It is distinguished by its product-informed design, multidimensional metrics, and reproducibility, addressing deficiencies found in previous agent benchmarks. AstaBench encompasses the entire spectrum of scientific discovery through a suite of over 2,400 problems inspired by authentic user queries to deployed Asta agents, with controlled, production-grade tooling that enables robust comparisons and systematic progress tracking.

1. Foundational Principles and Design

AstaBench is grounded in five principles designed to maximize benchmarking fidelity for scientific research AI agents:

Task Suite Realism: Task construction is informed by actual product usage data, reflecting the diversity and complexity of real-world scientific research—including literature reviews, experiment replication, data analysis, coding, hypothesis generation, and planning.
Standard, Realistic, and Reproducible Environment: Agents are evaluated within a controlled setting. The benchmark leverages production-grade search tools and computational notebooks to ensure uniform tool access and eliminate unfair advantages through privileged data or environment configurations.
Explicit Control of Confounders: Evaluation systematically accounts for model inference cost (measured in USD and reported per problem) and access distinctions for tools ("Standard," "Custom interface," "Fully custom"). This ensures cost-effectiveness and solution strategy transparency.
Standardized Task Formats: Uniform interfaces for all task types facilitate integration with general-purpose agents and fair cross-system comparison.
Comprehensive Baseline Suite: A broad selection of science-optimized and generic agent architectures are provided—with open-source implementations where feasible—forming the basis for controlled, comparative evaluation.

These principles distinguish AstaBench from prior benchmarks, which lacked holistic product-informed coverage, standardized tooling, explicit cost controls, unified task interfaces, and strong open baselines.

2. Scope of Benchmarks: Problem Types and Scientific Domains

AstaBench comprises 11 benchmark sets containing more than 2,400 problems that collectively span the full scientific research lifecycle and multiple domains. Tasks are constructed to assess core scientific agentic competencies:

Category	Benchmark Name	Domains	Test Size	Example Type
Literature Search, QA	PaperFindingBench	CS	267	Paper retrieval queries
	LitQA2-FullText(-Search)	Biology	75	MCQ and retrieval from papers
	ScholarQA-CS2	CS	100	Long-form science Q&A
	ArxivDIGESTables-Clean	Mixed	100	Literature table synthesis
Code & Experimentation	SUPER-Expert	CS	45	ML repository experiment replication
	CORE-Bench-Hard	Mixed	37	Published analysis reproduction
	DS-1000	CS	900	Data science coding
Data Analysis	DiscoveryBench	Mixed	239	Hypothesis generation
End-to-End Discovery	E2E-Bench, E2E-Bench-Hard	CS	80	Project orchestration

Coverage extends to computer science, biology, social science, engineering, economics, meta-science, medicine, and the humanities. The tasks encompass navigational/semantic literature search, long-form cited Q&A, table synthesis, code execution, hypothesis discovery from datasets, and orchestrated research projects.

3. Evaluation Environment and Tooling

Benchmarking takes place in the Asta Environment, which incorporates:

Asta Scientific Corpus Toolset: Access to a large, production-grade literature corpus via search APIs. Retrieval, metadata queries, author search, and citation lookup functionalities are available. Critically, queries are date-restricted to eliminate leakage from post-benchmark publications.
Computational Notebook Interface: Agents can interact with a stateful Python (Jupyter) notebook supporting shell commands, matplotlib, file I/O, and code execution in a time-bounded sandbox (e.g., 5 minutes per cell).
Tooling Protocol (MCP): All tools are decoupled from agent implementations and exposed via a standardized protocol, enabling agent reuse and fair comparison irrespective of internal implementation details.

This environment enforces reproducibility and robust agent-agent comparisons, elevating the benchmark’s rigor relative to those using unconstrained or ad hoc tool access.

4. Evaluation Methodology and Metrics

AstaBench employs the agent-eval toolkit to assess both accuracy and computational cost ( $/problem), using static (frozen) snapshots of community LLM pricing to neutralize temporal cost fluctuations. Comprehensive tracking of agent openness (open-source, open-weights, closed source) and tooling adherence increases transparency and reproducibility. Metrics are tailored to each task category: <ul> <li>Literature Search/Retrieval: F1 and nDCG for navigational and metadata queries; harmonic mean of estimated recall@n and nDCG for semantic search. For example, PaperFindingBench uses:</li> </ul> $ \text{Final Score}_{\text{semantic}} = \frac{2 \times \text{nDCG} \times \text{estimated-Recall@k}}{\text{nDCG} + \text{estimated-Recall@k}}$

Long-form QA: Composite LLM-as-judge scores on answer coverage, citation precision/recall, and relevance. Rubrics are automatically generated and clustered using LLMs.
Table Synthesis: Evaluation by converting tables to atomic statements and measuring entailment-based recall versus ground truth.
Code & Data Science: Output exact match, test pass rates, or LLM-judge correctness.
Data-Driven Discovery: Structured hypothesis entailment matching; scoring considers context, variables, and relationship extraction.
End-to-End Discovery: Stepwise LLM-judge rubric scores for reports, code, and related artifacts.

All experiments are logged with code, data commit hash, and model versions, ensuring result reproducibility. Score/cost trade-offs are visualized via Pareto curves on public leaderboards.

5. Agent and Baseline Architectures

AstaBench evaluates 57 agents across 22 classes, including nine science-optimized Asta agents and a suite of baselines:

Science-optimized Asta Agents: Asta v0 (orchestration), Paper Finder (retrieval), Scholar QA (long-form QA), Table Synthesis, Code (ReAct+traceability), DataVoyager (multi-agent dataset analysis), Panda (plan/act/report), CodeScientist (genetic search/joint discovery).
General Baselines: ReAct (minimum-viable tool-using LLM), SmolAgents Coder (code-based ReAct), Faker (fake results control).
Research-focused Baselines: Systems such as Elicit, STORM, OpenSciLM, OpenAI Deep Research, SciSpace, Perplexity Sonar, and commercial offerings (You.com, FutureHouse Crow/Falcon).
Model Coverage: Agents instantiated across OpenAI (gpt-4o, gpt-5, o3), Anthropic Claude, Google Gemini, Meta LLaMA and both open/closed weights.

This breadth ensures systematic identification of advances and limitations across design paradigms.

6. Empirical Findings and Performance Analysis

Major findings from the evaluation of 57 agents include:

General-purpose Science AI Remains Unsatisfactory: No agent class achieves high performance across all scientific tasks; top open-source agents (open-weight LLMs) score ~11% macro-average, while best closed orchestrators reach ~53%.
Impact of Engineering and Tooling: Specialized agents (e.g., Asta v0) outperform generic baselines by approximately 9% but require greater engineering effort and, at times, incur substantially higher inference costs.
Trade-offs Between Cost and Quality: Pareto analysis reveals, for example, ReAct + gpt-5-mini attains 32% at $0.04/problem, while higher-accuracy agents often entail orders-of-magnitude greater costs. More expensive models may, counterintuitively, reduce overall cost due to increased efficiency per step compared to weaker, low-cost alternatives.
Non-uniform Impact of LLM Upgrades: The latest LLMs (gpt-5) improve ReAct workflows but can degrade performance of certain science-optimized agents.
Category-specific Challenges:
- Literature Search/Q&A: Best agents attain ~80–90% (Asta Paper Finder, Scholar QA, Elicit, SciSpace).
- Table Synthesis: Unsatisfactory; top recall ~43%.
- Coding/Execution: Substantial bottleneck—<25% for complex repo reproduction; classic data science tasks (DS-1000) are nearly perfectly solved by open-source code agents.
- Data-Driven Discovery: Maximum observed ~34% on DiscoveryBench.
- End-to-End Discovery: ~70% step completion, but full project orchestration is exceedingly rare (<1% for leading agents).

A plausible implication is that engineering accessible, robust, tool-using agents presents persistent complexities, and that holistic science automation remains an open research challenge.

7. Benchmarking Protocols, Reproducibility, and Community Interface

AstaBench provides:

Open-source evaluation code, agent implementations, and standardized research environments.
Publication of all logs and experiment artifacts, including precise model and data hashes.
Clearly defined, task-specific metrics, with interactive, public leaderboards and transparent submission/authentication criteria.

This facilitates direct, reproducible benchmarking and maximizes transparency across evaluations and agent comparisons.

AstaBench establishes a new rigorous standard for AI agent benchmarking in scientific research, characterized by extensive problem diversity, controlled environments, transparent multi-factor reporting, comprehensive baselines, and robust evaluation protocols. Despite isolated progress, results demonstrate that comprehensive AI science assistance remains an unsolved challenge, positioning AstaBench as a reference foundation for future research directions and systematic tracking of agent capabilities in scientific domains.

PDF Markdown Chat (Pro)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to AstaBench.