ScienceAgentBench: LLM Scientific Benchmark

Updated 9 July 2025

ScienceAgentBench is a benchmark that rigorously assesses LLM agents’ scientific reasoning and code generation across authentic, peer-reviewed tasks.
It employs detailed metrics like valid execution rate, success rate, and CodeBERTScore to quantify both technical accuracy and domain relevance.
The framework exposes current limitations in error handling and complex workflow execution, highlighting the need for improved automation in scientific research.

ScienceAgentBench is a rigorously designed benchmark for evaluating the scientific reasoning and code generation capabilities of LLM agents in realistic, data-driven discovery workflows. Developed in response to the growing yet untested claims regarding LLM-powered end-to-end automation in scientific research, ScienceAgentBench focuses on granular, task-level assessment anchored in real-world scientific practice. The benchmark systematically assembles and validates tasks sourced from peer-reviewed publications, implements quantitative and rubric-based evaluation strategies, and provides a public reference point for measuring progress in LLM-based scientific assistance and automation (Chen et al., 7 Oct 2024).

1. Design and Purpose

ScienceAgentBench was motivated by the need for rigorous measurement of LLM agents’ ability to assist in data-driven scientific discovery, moving beyond abstract claims of end-to-end research automation. Instead of evaluating agents on simulated or toy problems, tasks were directly extracted from 44 peer-reviewed papers spanning four scientific disciplines: Bioinformatics, Computational Chemistry, Geographic Information Science, and Psychology/Cognitive Neuroscience.

A total of 102 tasks were curated, each representing a self-contained stage of a real scientific workflow—such as data loading, analysis, statistical modeling, and visualization. Task validation involved multiple rounds of manual review and annotation by a panel of nine subject matter experts, ensuring that task statements, datasets, and expected outputs maintain authentic relevance and scientific rigor. All task outputs are unified into standalone, executable Python programs whose functional correctness can be systematically assessed.

2. Evaluation Framework and Metrics

ScienceAgentBench employs a comprehensive, multi-metric evaluation protocol to capture both technical execution and domain-specific correctness:

Valid Execution Rate (VER): Measures the fraction of generated programs that execute error-free and save outputs using the correct naming convention.
Success Rate (SR): Reflects the proportion of agent-generated programs whose execution outputs satisfy domain-specific, expert-defined success criteria—such as correct predictions, statistical results, or visualization outputs. This metric is typically presented as

$\text{SR} = \frac{\text{Number of Successful Tasks}}{\text{Total Number of Tasks}}$

CodeBERTScore (CBS): Computes the similarity between generated code and annotated references using contextual embeddings; successful solutions (SR = 1) are assigned CBS = 1.0, while unsuccessful runs are scored based on token overlap and semantic proximity:

$\text{CBS} = \text{F1}(\text{tokens}_{\text{gen}}, \text{tokens}_{\text{ref}})$

API Cost: Quantifies the monetary expenditure required to run and score a submission; this metric is critical for real-world deployment.

To address the risk of data contamination—given that some datasets are openly available—the benchmark applies two explicit mitigations: (1) randomly removing a small number of data points from provided test splits, and (2) replacing supervised task test labels with dummy values (e.g. −1), ensuring the agent does not simply output memorized solutions.

3. Experimental Protocol and Model Performance

ScienceAgentBench was used to benchmark five open-weight and proprietary LLMs, tested with three established agentic frameworks: direct prompting, OpenHands CodeAct, and self-debug strategies. Each agent receives three attempts per task. Under standard evaluation:

The leading agentic framework achieves independent task solution rates of only about 32.4%, rising to 34.3% with expert-provided supplementary knowledge.
An advanced proprietary agent, OpenAI o1-preview, demonstrates higher performance when allowed increased inference-time compute—solving 42.2% of tasks via self-debug—but incurs a tenfold higher API cost.

Performance analyses reveal several critical limitations:

Even strong agents produce code with frequent low-level errors (e.g., invalid API calls, failure to save output correctly) despite sound high-level structure.
Success rates drop sharply for tasks requiring longer or more intricate programs (beyond roughly 58 lines), indicating insufficient robustness on involved real-world workflows.
Substantial gaps remain compared to human programmers, even though agents produce first-draft solutions in about 10 minutes, whereas a human typically requires 2.5–3 hours per task.

4. Scientific Authenticity and Data Curation

The authenticity of ScienceAgentBench is underpinned by its origin in published scientific work and multi-stage expert validation. The task extraction protocol ensures that every item:

Replicates actual steps encountered in published research,
Is annotated using expert-reviewed rubrics and reference programs,
Is reviewed iteratively until both technical clarity and scientific accuracy are established.

This workflow guards against artificial simplifications and ensures output requirements correspond to real research practice. All tasks are defined to be self-contained (i.e., executable with only the provided files and instructions), making the benchmark robust to data contamination and faithfully reflecting open-world research settings.

5. Limitations, Practical Implications, and Path Forward

Despite notable improvements in code generation and rapid draft creation, the ScienceAgentBench results demonstrate key obstacles to practical end-to-end automation:

Error Robustness: Agents frequently fail at granular, low-level tasks critical for research (e.g., correct use of libraries, handling edge cases in datasets). This limits their reliability in unsupervised research environments.
Complex Workflow Handling: Program length and domain-specific tool usage are significant bottlenecks. As task complexity grows, success rates decline rapidly, revealing the need for advanced planning and domain knowledge integration.
Cost-Performance Tradeoff: Additional inference-time compute (e.g., through self-debug routines) can boost solution rates but leads to a more than tenfold increase in computational cost, challenging the practical deployment of such agents.

A plausible implication is that, although LLM-based language agents accelerate initial code generation, extensive post-processing or expert intervention remains necessary for dependable output—particularly for sophisticated analyses common in scientific domains.

6. Relation to the Broader Research Landscape

ScienceAgentBench both complements and advances the state-of-the-art in agentic evaluation for scientific research. It shares conceptual overlap with benchmarks in automated scientific discovery and coding—such as AgentBench (Liu et al., 2023), AgentQuest (Gioacchini et al., 9 Apr 2024), and MLGym (Nathani et al., 20 Feb 2025)—but is uniquely distinguished by:

Strong anchoring in peer-reviewed scientific practice;
Comprehensive, expert-verified task curation;
Unified evaluation protocols based both on code execution and scientific output quality.

Recent extensions, including automated data scaling pipelines like AutoSDT (Li et al., 9 Jun 2025), demonstrate that fine-tuning on large, ecologically valid task datasets can significantly boost open-weight LLM performance, with the gap between top proprietary and open-source models narrowing for the kinds of data-driven discovery tested in ScienceAgentBench.

7. Technical Artifacts and Resources

ScienceAgentBench provides open access to:

All benchmark tasks, datasets, and reference solutions,
Detailed evaluation protocols and rubric definitions,
Representative code snippets and LaTeX formulations for tasks and metrics.

Typical evaluation scripts feature domain-specific verification of outputs and are suitable for integration with both academic and industrial research toolchains. The inclusion of standardized metrics (VER, SR, CBS, API Cost) enables reproducible benchmarking and direct comparison across a diverse set of LLM agents and frameworks.

ScienceAgentBench establishes a rigorous foundation for evaluating the true capabilities of LLM-based agents in scholarly discovery. Its results provide a candid perspective on both the promise and the current limitations of automated research agents, highlighting a need for further innovation in error correction, workflow robustness, and domain adaptation before full automation of scientific programming becomes viable (Chen et al., 7 Oct 2024).