PaperBench: AI Research Replication Benchmark

Updated 10 December 2025

The paper introduces a benchmark that challenges AI agents to autonomously replicate entire ML research pipelines from paper comprehension to experimental validation.
PaperBench decomposes each task into 8,316 atomic grading steps, spanning code development, execution, and result matching using a detailed, hierarchical rubric.
Empirical results reveal that AI agents, despite using various frameworks, significantly lag behind human experts, highlighting challenges in long-horizon research reproduction.

PaperBench is a benchmark specifically designed to assess the capacity of AI agents to autonomously replicate contemporary machine learning research papers, with a focus on evaluating long-horizon R&D abilities. Distinct from prior reproducibility efforts that concentrate on recreating results given source code, PaperBench tasks agents with reconstructing entire empirical pipelines from scratch: understanding a paper’s contributions, developing a cohesive codebase (including a reproducible script), executing experiments in a clean environment, and matching published results. PaperBench comprises 20 selected ICML 2024 Spotlight and Oral papers, each paired with an author-approved, hierarchically structured rubric totalling 8,316 atomic grading tasks. To support scalable evaluation, it introduces a LLM-based judge that grades agent output at fine granularity. Substantial empirical analysis reveals that, while leading agents achieve mean replication scores around 20–25%, these systems remain markedly behind the human expert baseline of 41% (Starace et al., 2 Apr 2025).

1. Benchmark Structure and Scope

PaperBench targets the “long-horizon” automation of modern AI research engineering. Each benchmark task involves a recent, high-impact machine learning publication, typically representing significant empirical advances across various domains including deep reinforcement learning, probabilistic models, robustness, and LLMs. For each of the 20 papers:

Task Definition: Agents are required to parse the original paper (PDF/Markdown), synthesize code from text and equations, structure a working repository with an entry-point script (reproduce.sh), and execute core experiments, all within an isolated containerized environment.
Tractability and Representativity: Papers are chosen for their significance as well as feasibility—they are recent enough to preclude agent memorization but empirically grounded enough for replication.
Objective: The principal measure is agent ability to independently re-implement, execute, and reproduce all principal contributions of a research paper, quantifying an essential component of autonomous scientific reasoning.

2. Task Decomposition and Grading Rubrics

Central to PaperBench is the use of detailed, author-collaborative rubrics that translate the “replicate all contributions” objective into a tree of granular requirements:

Hierarchical Task Decomposition: The overall replication task is recursively decomposed: top-level nodes correspond to major paper contributions, which split into atomic “leaf” nodes representing gradable, specific requirements. Across all papers, there are 8,316 leaf tasks.
Leaf Node Types:
- Code Development: E.g., “Is this component implemented correctly in the submitted source code?”
- Execution: E.g., “Does this step execute successfully via reproduce.sh?”
- Result Match: E.g., “Does this output match the original results within specified tolerances?”
Scoring Propagation: Each leaf node $j$ is weighted by $w_j$ (reflecting rubric author input), assigned a binary score $s_j \in \{0,1\}$ . Parent nodes aggregate normalized, weighted averages from children:

$S_P = \frac{\sum_{j \in C(P)} w_j s_j}{\sum_{j \in C(P)} w_j}$

where $C(P)$ is the set of $P$ 's children. The root node score $S_\text{root}$ reflects overall replication fidelity per paper, and the PaperBench score is $S_\text{overall} = \frac{1}{20}\sum_{i=1}^{20} S_\text{root}^{(i)}$ .

3. Automated Judging and Rubric Validation

To render large-scale grading feasible, PaperBench incorporates an LLM-based “SimpleJudge” as the primary evaluator:

Judge Pipeline: For each rubric leaf, the judge ingests the full rubric structure, leaf requirement text, original paper, any supplemental author clarifications, and a curated subset of submission artifacts (source, logs, outputs).
Binary Evaluation: The judge returns pass/fail decisions per leaf, alongside a rationalization. Outputs without binary encodings are automatically retried.
JudgeEval Validation: SimpleJudge’s accuracy is benchmarked against manual ground-truth annotations (JudgeEval), where the o3-mini-high backend achieves F1=0.83 with the best cost-performance balance (versus alternatives like GPT-4o at higher cost and lower F1) [(Starace et al., 2 Apr 2025), Table 1].
Scaling: This infrastructure renders it tractable to grade thousands of leaf tasks per agent submission.

4. Agent Evaluation and Human Baseline

PaperBench quantitatively evaluates a range of advanced agents on its suite:

Agent Scaffolds:
- BasicAgent: ReAct-style agent in an isolated Dockerized environment, equipped with shell, Python, web, and file inspection tools, with a 12-hour wall-clock budget.
- IterativeAgent: Augments prompting and policies to encourage incremental development and avoid premature submission.
Models Evaluated: o3-mini-high, o1-high, Claude 3.5 Sonnet, GPT-4o, DeepSeek-R1, Gemini 2.0 Flash.
Key Results ([(Starace et al., 2 Apr 2025), Table 2]):

| Model | BasicAgent Score (%) | IterativeAgent Score (%) | Max Score (36h) | |---------------------|---------------------|-------------------------|-----------------| | o3-mini-high | 2.6 | 8.5 | – | | o1-high | 13.2 | 24.4 | 26.0 | | Claude 3.5 Sonnet | 21.0 | 16.1 | – | | GPT-4o | 4.1 | – | – | | DeepSeek-R1 | 6.0 | – | – | | Gemini 2.0 Flash | 3.2 | – | – |

Claude 3.5 Sonnet (BasicAgent) leads standard runs at 21%, but all agents are significantly below the human PhD baseline of 41.4%.

Learning Dynamics: o1-high briefly outpaces humans in the first hour but plateaus; humans close the gap over multi-day periods.

5. PaperBench Code-Dev and Fine-Grained Reproduction

PaperBench also includes a “Code-Dev” variant that focuses exclusively on the code generation aspect, omitting computationally expensive full experiment reproduction:

Scope: Grades only “Code Development” rubric leaves, yielding a more accessible and cost-efficient evaluation. On Code-Dev, o1-high achieves 43.4% under IterativeAgent (Pearson $r \approx 0.48$ to full benchmark), with a ∼$4,000 benchmark cost per run.
Reflective Agents and Verification: The RePro agent (Zhou et al., 21 Aug 2025), applying fine-grained verification with binary paper-derived fingerprints, achieves 62.6% root pass ratio, outperforming strong baselines (AutoReproduce 49.6%, PaperCoder 45.1%). Iterative refinement via targeted feedback is shown to increase pass rates up to 4 iterations, with diminishing returns beyond that point.
Ablation Findings: Omitting fingerprint comprehensiveness or atomicity yields large drops in performance (~6-7 percentage points). Case analyses reveal that RePro’s margin is concentrated on demanding mathematical and algorithmic logic criteria.

6. Implementation and Extensibility

PaperBench is released open-source with full infrastructure for reproducible benchmarking:

Repository Structure:
- rubrics/: Fully weighted, author-endorsed grading rubrics per paper.
- addenda/: Detailed clarifications.
- judge/: SimpleJudge implementation and prompt templates.
- agents/: Modular agent scaffolds (BasicAgent, IterativeAgent).
- evaluation/: JudgeEval data and scripts for performance tracking.
- reproduce/: Docker and VM setup for agent runs.
Extensibility: Benchmark extensibility includes adding new papers (with rubrics), upgrading judge models, or developing novel agent frameworks.

7. Significance and Long-Term Implications

PaperBench establishes an empirical and scalable standard for quantifying AI autonomy in ML research engineering. Key implications:

Rigorous Measurement: By requiring full-stack research reproduction from natural language, PaperBench exposes current agentic and LLM capabilities and failure modes not addressed by code-matching or static analysis.
Human–Agent Gap: All tested agents trail expert humans (by roughly a factor of two), and failure often stems from agents plateauing on complex multi-stage engineering tasks.
Driving Progress: The benchmark, by virtue of its demanding granularity and collaboration with original paper authors, provides a quantitative roadmap for agentic toolchain improvements, diagnostics, and operational safety evaluation.
Safety and Preparedness: The design and analysis in PaperBench are directly motivated by the needs of AI safety and preparedness evaluation frameworks (e.g., OpenAI’s Preparedness Framework, Anthropic’s Responsible Scaling Policy), especially under scenarios where autonomous agents may drive research progress or introduce novel risks.

A plausible implication is that as both agent frameworks and LLM judges improve, PaperBench will catalyze advances in autonomous research capabilities and rigorous benchmarking methodology, as well as inform policy and oversight for advanced AI systems (Starace et al., 2 Apr 2025, Zhou et al., 21 Aug 2025).