Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
121 tokens/sec
GPT-4o
9 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

PaperBench: Evaluating AI's Ability to Replicate AI Research (2504.01848v3)

Published 2 Apr 2025 in cs.AI and cs.CL

Abstract: We introduce PaperBench, a benchmark evaluating the ability of AI agents to replicate state-of-the-art AI research. Agents must replicate 20 ICML 2024 Spotlight and Oral papers from scratch, including understanding paper contributions, developing a codebase, and successfully executing experiments. For objective evaluation, we develop rubrics that hierarchically decompose each replication task into smaller sub-tasks with clear grading criteria. In total, PaperBench contains 8,316 individually gradable tasks. Rubrics are co-developed with the author(s) of each ICML paper for accuracy and realism. To enable scalable evaluation, we also develop an LLM-based judge to automatically grade replication attempts against rubrics, and assess our judge's performance by creating a separate benchmark for judges. We evaluate several frontier models on PaperBench, finding that the best-performing tested agent, Claude 3.5 Sonnet (New) with open-source scaffolding, achieves an average replication score of 21.0%. Finally, we recruit top ML PhDs to attempt a subset of PaperBench, finding that models do not yet outperform the human baseline. We open-source our code (https://github.com/openai/preparedness) to facilitate future research in understanding the AI engineering capabilities of AI agents.

Summary

  • The paper introduces a benchmark that evaluates AI agents' ability to replicate experimental results from 20 recent AI research papers using detailed, binary-criteria rubrics.
  • It outlines a methodology where agents generate self-contained Git repositories to reproduce experimental procedures under strict execution conditions.
  • The evaluation highlights a significant performance gap between current AI agents and human experts, emphasizing challenges in long-horizon planning and debugging.

PaperBench introduces a benchmark designed to evaluate the capability of AI agents to replicate empirical results from state-of-the-art AI research papers (2504.01848). The core task involves agents replicating findings from recent publications, specifically 20 papers selected from ICML 2024 Spotlights and Orals, covering diverse machine learning subfields. This evaluation aims to assess complex AI engineering skills, including comprehension of research contributions, development of functional codebases, and execution of experiments to reproduce published results.

Benchmark Design and Task Definition

The benchmark defines a specific replication task for AI agents. Each agent is provided with the PDF or Markdown version of a selected research paper and an optional author-provided addendum containing clarifications or corrections, but not the original source code or the evaluation rubric itself. The agent's objective is to produce a self-contained Git repository. This repository must include all necessary source code, dependencies, and a primary execution script named reproduce.sh.

The evaluation process involves executing this reproduce.sh script within a standardized, controlled environment. The experiments detailed in the paper were conducted using Ubuntu 24.04, equipped with an A10 GPU, and subjected to a maximum runtime limit of 12 hours. This execution phase generates outputs, such as log files (reproduce.log) and result artifacts, which are subsequently used for grading the replication attempt against the paper-specific rubric. This controlled execution ensures reproducibility and prevents agents from simply hardcoding expected results.

Rubric Development and Structure

A central component of PaperBench is its detailed, hierarchical rubric system, developed in collaboration with the original authors of the benchmarked papers. This collaboration ensures the rubrics accurately reflect the core contributions and necessary steps for replicating each paper's results realistically. In total, the benchmark comprises 8,316 individually gradable leaf nodes across the 20 papers.

Each rubric follows a tree structure, decomposing the high-level replication goal into progressively smaller, more granular sub-tasks represented by leaf nodes. These leaf nodes are designed to represent precise, binary (pass/fail) criteria that an expert could assess relatively quickly (target < 15 minutes per node). Leaf nodes fall into three categories:

  1. Code Development: Assesses the correctness and presence of specific code implementations or components based on the paper's description.
  2. Execution: Verifies whether specific parts of the submitted code successfully execute via the reproduce.sh script, often checking for expected outputs or log messages indicating successful completion of a stage.
  3. Result Match: Compares the quantitative results generated during the reproduction phase (e.g., model performance metrics, statistical findings) against the results reported in the original paper, within acceptable tolerances defined in the rubric.

Leaf nodes are assigned a score of 0 or 1. These scores are then propagated up the rubric tree using weighted averages, ultimately yielding a single "Replication Score" between 0% and 100% for the entire replication attempt. This hierarchical and granular structure allows for partial credit and detailed diagnosis of failure modes.

Automated Evaluation using LLM Judge

Given the significant cost and time required for manual expert grading across thousands of rubric nodes, PaperBench introduces an automated evaluation workflow centered around an LLM-based judge. The primary judge implementation, termed "SimpleJudge", utilizes the o3-mini model by default.

The automated grading process operates on a per-leaf-node basis. For each node, the LLM judge receives:

  • The research paper content (PDF/Markdown).
  • The author-provided addendum.
  • The full structure of the hierarchical rubric for context.
  • The specific requirement text of the leaf node being graded.
  • Relevant files from the agent's submitted repository (including code and execution outputs like reproduce.log), potentially filtered to manage the LLM's context window limitations.

The judge outputs a binary score (0 or 1) and textual reasoning for its decision regarding the fulfiLLMent of the leaf node's criteria.

To validate the reliability of this automated approach, a separate benchmark called JudgeEval was created. JudgeEval consists of a set of agent submissions graded by human experts, providing gold-standard labels. Automated judges are evaluated against these labels using metrics like the F1 score. The selected configuration (SimpleJudge with o3-mini) achieved an F1 score of 0.83 on JudgeEval, indicating reasonable agreement with human expert judgment and justifying its use for scalable evaluation in PaperBench.

Furthermore, a lightweight variant named PaperBench Code-Dev was introduced. This version focuses exclusively on the Code Development aspects of the rubrics, omitting the execution and result-matching components. This significantly reduces the computational cost (no GPU execution needed) and grading complexity, making participation more accessible.

Experimental Setup and Baselines

The paper presents baseline results for several frontier LLMs configured as AI agents. The top-performing agent utilized Claude 3.5 Sonnet (New) integrated with a basic open-source agent scaffolding ("BasicAgent") designed for planning, coding, execution, and debugging cycles. Another notable agent evaluated was OpenAI o1. Other models were also tested but demonstrated lower performance.

A human baseline was established by recruiting experienced ML PhD students. These human experts attempted to replicate a subset of 3 papers from the benchmark, given a time limit of 48 hours per paper.

Key Findings and Performance Analysis

The evaluation revealed significant limitations in the current capabilities of AI agents for complex research replication tasks.

  • Overall Performance: The best-performing AI agent configuration (Claude 3.5 Sonnet + BasicAgent) achieved an average Replication Score of 21.0% across the 20 papers. The OpenAI o1 agent scored 13.2%. Other tested models scored below 10%.
  • Human Baseline: On the 3-paper subset attempted by ML PhDs, the best-of-3 human attempts achieved an average score of 41.4% within the 48-hour limit.
  • Performance Gap: A substantial gap exists between the best current AI agents and human experts. While agents like o1 demonstrated rapid initial code generation, sometimes outperforming humans in the first hour, their progress quickly plateaued. Humans, although slower initially, showed more sustained improvement and significantly outperformed the o1 agent after 24 hours on a direct comparison subset. This suggests current agents struggle with long-horizon planning, robust execution, complex debugging, and strategic refinement required for end-to-end research replication.
  • Performance Breakdown: Agents generally performed better on Code Development nodes compared to Execution and Result Match nodes. This indicates that while models can generate plausible code snippets based on descriptions, they face greater challenges in integrating these components into a fully working system, successfully executing the required experimental pipeline, and precisely reproducing the numerical results reported in the papers.
  • Scaffolding Sensitivity: Agent performance was noted to be sensitive to the specific scaffolding (e.g., BasicAgent) and prompting strategies employed.

Conclusion

PaperBench provides a structured framework and benchmark for quantitatively assessing the ability of AI agents to perform complex AI research replication tasks. The results indicate that while current frontier models possess nascent capabilities in understanding papers and generating relevant code, they fall significantly short of human expert performance, particularly in executing complete experiments and matching published results. The benchmark, rubrics, and automated judge serve as valuable tools for measuring future progress in developing more capable AI agents for scientific research and AI engineering workflows. The open-sourcing of the benchmark code aims to facilitate further research in this area [https://github.com/openai/preparedness].

Github Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com