CORE-Bench: Fostering the Credibility of Published Research Through a Computational Reproducibility Agent Benchmark (2409.11363v1)

Published 17 Sep 2024 in cs.CL, cs.AI, and cs.MA

Abstract: AI agents have the potential to aid users on a variety of consequential tasks, including conducting scientific research. To spur the development of useful agents, we need benchmarks that are challenging, but more crucially, directly correspond to real-world tasks of interest. This paper introduces such a benchmark, designed to measure the accuracy of AI agents in tackling a crucial yet surprisingly challenging aspect of scientific research: computational reproducibility. This task, fundamental to the scientific process, involves reproducing the results of a study using the provided code and data. We introduce CORE-Bench (Computational Reproducibility Agent Benchmark), a benchmark consisting of 270 tasks based on 90 scientific papers across three disciplines (computer science, social science, and medicine). Tasks in CORE-Bench consist of three difficulty levels and include both language-only and vision-language tasks. We provide an evaluation system to measure the accuracy of agents in a fast and parallelizable way, saving days of evaluation time for each run compared to a sequential implementation. We evaluated two baseline agents: the general-purpose AutoGPT and a task-specific agent called CORE-Agent. We tested both variants using two underlying LLMs: GPT-4o and GPT-4o-mini. The best agent achieved an accuracy of 21% on the hardest task, showing the vast scope for improvement in automating routine scientific tasks. Having agents that can reproduce existing work is a necessary step towards building agents that can conduct novel research and could verify and improve the performance of other research agents. We hope that CORE-Bench can improve the state of reproducibility and spur the development of future research agents.

Authors (5)

Zachary S. Siegel (5 papers)
Sayash Kapoor (24 papers)
Nitya Nagdir (1 paper)
Benedikt Stroebl (6 papers)
Arvind Narayanan (48 papers)

Summary

An Overview of CORE-Bench: A Computational Reproducibility Agent Benchmark

The paper "CORE-Bench: Fostering the Credibility of Published Research Through a Computational Reproducibility Agent Benchmark" by Siegel et al. introduces a novel benchmark aimed at enhancing the automation of computational reproducibility within scientific research. With the increasing prevalence of AI agents in scientific workflows, the authors propose CORE-Bench as a rigorous assessment tool for the accuracy and efficiency of these agents in reproducing research results using provided code and data. This essay provides an expert summary of the benchmark, key findings, and implications of this research.

Context and Need for CORE-Bench

Computational reproducibility—a fundamental aspect of scientific integrity—requiring the verification of experimental results using the original paper's data and code, faces significant challenges. Studies across multiple disciplines, including psychology, economics, and computer science, frequently report a lack of reproducibility even when materials are available. This issue underscores the necessity of developing mechanisms to ensure reliability in scientific findings. Given recent advancements in AI, the potential for AI agents to automate and streamline reproducibility tasks is immense, necessitating a robust benchmark to measure their performance accurately.

Construction and Features of CORE-Bench

CORE-Bench consists of 270 tasks derived from 90 scientific papers across computer science, social science, and medicine. Each of these tasks corresponds to different levels of complexity and involves various modalities, including purely text-based tasks and those requiring interpretation of visual data. The benchmark is structured into three difficulty levels:

CORE-Bench-Easy: Agents are provided with the complete output from a successful code run.
CORE-Bench-Medium: Agents receive a Dockerfile and text instructions to execute the code.
CORE-Bench-Hard: Agents are only provided with a README file, requiring them to install dependencies and figure out execution procedures independently.

This design ensures that CORE-Bench evaluates a diverse set of capabilities, from information extraction to complex environment interactions.

Evaluation System and Baseline Agents

The paper details an innovative evaluation system that allows for fast and parallel assessment of agents, significantly reducing the evaluation time compared to sequential implementations. Two baseline agents were tested: a general-purpose AutoGPT and a task-specific variant called CORE-Agent. Both agents were evaluated using two underlying LLMs: GPT-4o and GPT-4o-mini.

Key Results and Implications

The best-performing agent, CORE-Agent with GPT-4o, achieved 60% accuracy on the easiest tasks and 21% on the hardest tasks, highlighting the current limitations and the considerable scope for improvement. The authors note that even simple task-specific modifications can significantly enhance performance, especially for weaker LLMs. These findings are crucial given the substantial performance drop observed at higher difficulty levels, emphasizing the challenges that current AI agents face in automating complex reproducibility tasks.

Practical and Theoretical Implications

Practically, CORE-Bench aims to facilitate several critical activities in the scientific community. Authors can use it to verify the reproducibility of their work before publication, while independent researchers and reviewers can quickly assess reproducibility claims. Theoretically, the benchmark provides a structured approach to evaluating and developing AI agents, contributing to advancements in AI's capability to support and enhance scientific research.

Future Directions

The benchmark opens several avenues for future research and development. Key among these is improving the robustness and generalizability of AI agents to handle diverse and unforeseen scenarios encountered in real-world reproducibility tasks. Additionally, as AI technology evolves, incorporating stronger models and further refining task-specific strategies could lead to notable improvements. The potential to periodically update CORE-Bench with new tasks ensures its relevance and applicability over time.

Conclusion

CORE-Bench embodies a significant step towards leveraging AI to enhance scientific reproducibility, providing a structured, scalable, and rigorous evaluation framework. While current AI agents exhibit promising capabilities, the findings from this benchmark underscore the need for continued development and innovation. By facilitating the automation of reproducibility tasks, CORE-Bench could play a pivotal role in ensuring the credibility and reliability of computational research, thus fostering greater trust in scientific findings.

PDF Markdown

Related Papers

Tweets

https://twitter.com/sayashk/status/1836430461003792485

https://twitter.com/sayashk/status/1841517052701413706

https://twitter.com/PeterHndrsn/status/1899819491690795434

https://twitter.com/siegelz_/status/1858551617139912971

https://twitter.com/benediktstroebl/status/1841506766921540039

https://twitter.com/BogdanIonutCir2/status/1886887340498456732

YouTube

Show All Videos

HackerNews

Core-Bench: Computational Reproducibility Agent Benchmark (1 point, 0 comments)