Paper2Code: Automated Reproducible Code

Updated 9 December 2025

The paper introduces a novel multi-agent LLM framework that automates converting ML research papers into complete, dependency-correct code repositories.
It employs a three-stage pipeline—planning, analysis, and generation—that enforces modular design and proper dependency management, achieving 44.26% replication on benchmarks.
Its architecture integrates collaborative verification and refinement agents, setting a new standard in reproducibility and automated research code generation.

Paper2Code is a multi-agent LLM framework designed to automatically transform scientific machine learning papers into complete, functional code repositories. Motivated by persistent barriers in reproducibility and code availability within ML research, Paper2Code operationalizes the end-to-end workflow of software engineering through specialized planning, analysis, and coding agents, producing modular, dependency-correct, and author-quality implementations directly from raw papers. Its architecture mirrors the systematic practices of expert code base construction and is evaluated quantitatively and qualitatively for fidelity and completeness on established benchmarks (Seo et al., 24 Apr 2025).

1. Problem Motivation and High-Level Objectives

Code implementations are often missing from newly published ML papers, undermining reproducibility and impeding research progress. While recent LLMs excel at natural language understanding and code synthesis, direct translation of research papers into working repositories remains an unsolved challenge. Paper2Code addresses this gap with the following explicit objectives:

Automate the full workflow of reproducing ML papers without any human-written code artifacts.
Simulate the modular software engineering lifecycle via collaborative LLM agents.
Produce repositories that are structurally correct, modular, and configuration-driven.
Benchmark against model-based and human author evaluations, using author-released “oracle” code for ground truth (Seo et al., 24 Apr 2025).

2. Three-Stage Multi-Agent Pipeline

The pipeline is structurally decomposed into three sequential agent-driven stages: planning, analysis, and generation.

Planning: The Planning Agent parses the input paper ( $R$ $R$ ) and outputs:
- Overall Plan ( $o$ ): roadmap of components, algorithms, datasets, metrics.
- Architecture Design ( $d$ ): UML file list, class and sequence diagrams.
- Logic Design ( $l$ ): ordered file list $\langle f_1,\ldots,f_n \rangle$ via dependency analysis.
- Config Generation ( $g$ ): config.yaml containing all experiment hyperparameters/paths.
Analysis: For each file $f_i$ in $l$ , the Analysis Agent produces specification $a_i$ , detailing methods, inputs/outputs, and constraints as inferred from the paper and planning phase.
Generation (Coding): The Coding Agent consumes the prior artifacts to emit executable source code $c_i$ for each file, integrating past generated code components.

Formally:

$P = M_{\mathrm{plan}}(R)$
$A = M_{\mathrm{analysis}}(R, P)$
$C = M_{\mathrm{coder}}(R, P, A)$

This workflow is depicted in a condensed TikZ diagram:

$\begin{tikzpicture}[node distance=1.2cm,>=latex] \tikzstyle{agent}=[draw,thick,fill=blue!5,rounded corners,text width=2.5cm,align=center] \tikzstyle{data}=[draw,thick,fill=green!5,rounded corners,text width=2cm,align=center] \node[data] (paper) {Input Paper %%%%16%%%%}; \node[agent,below=of paper] (plan) {Planning Agent\(M_{\mathrm{plan})}; \node[data,below=of plan] (planout) {Plan %%%%17%%%%}; \node[agent,below=of planout] (analysis) {Analysis Agent\(M_{\mathrm{analysis})}; \node[data,below=of analysis] (specs) {File Specs %%%%18%%%%}; \node[agent,below=of specs] (coder) {Coding Agent\(M_{\mathrm{coder})}; \node[data,below=of coder] (code) {Code %%%%19%%%%}; \draw[->,thick] (paper) -- (plan); \draw[->,thick] (plan) -- (planout); \draw[->,thick] (planout) -- (analysis); \draw[->,thick] (analysis) -- (specs); \draw[->,thick] (specs) -- (coder); \draw[->,thick] (coder) -- (code); \end{tikzpicture}$

3. Agent Specialization and File Dependency Management

Each agent executes a defined logic reflecting its functional role:

Planning Agent: Generates all planning artifacts via summary, architectural synthesis, topological file ordering, and hyperparameter collation. For dependency management, it builds a DAG of file imports and utilizes topological sorting to ensure correct build order. The configuration file is assembled by extracting all paper-specific parameters and paths (Seo et al., 24 Apr 2025).
Analysis Agent: For every file in the ordered list, it refines the implementation sketch into a full specification, listing all functions, classes, APIs, and design constraints.
Coding Agent: Code emission integrates the outputs of prior agents and previously generated files, enabling modular and dependency-respecting implementation.

Each phase is realized via dedicated prompts and context management chains that minimize hallucinations and scope each substage precisely.

Paper2Code's performance is benchmarked through both automated (model-based) and human (author) evaluations:

Automated: Reference-based and reference-free LLM-judge scores ( $s_\text{ref}$ , $s_\text{free}$ , both $\in[1,5]$ ) assess completeness and fidelity with respect to ground-truth code or just paper artifacts. Sampling stability is ensured using G-Eval protocols.
Human: Authors directly rank outputs across three frameworks, annotate coverage on data, method, and evaluation axes, and convert ranks to a points system for quantitative analysis (Seo et al., 24 Apr 2025, Lin et al., 2 Dec 2025).
Collaborative Agents (Verification and Refinement): The framework can be augmented with prompt-free collaborative agents that check and revise outputs at every step. The Verification Agent rates functional requirements from system prompts, while the Refinement Agent iteratively applies necessary corrections, always in reference to the original prompt, thus achieving alignment and modular compliance. Integrating these agents yields measured improvements (+15% accuracy; +13% completeness over baselines), as validated on both PaperBench Code-Dev and Paper2CodeBench datasets (Lin et al., 2 Dec 2025).

PaperBench Results Table

Model	Replication (%)
BasicAgent	5.1 ± 0.8
IterativeAgent	16.4 ± 1.4
Paper2Code (Ours)	44.26

5. Implementation Methodology and Practical Considerations

Backbone LLM: All agents typically use o3-mini-high; benchmarked alternatives include DS-Coder, Qwen-Coder, and DS-Distill-Qwen.
Preprocessing: Conversion of papers via openreview_scraper and s2orc-doc2json into structured JSON greatly improves planning reliability.
Prompt Chaining: Each agent produces discrete artifacts prior to handover, strengthening pipeline modularity and reducing hallucination likelihood.
Debuggability: Manual inspection indicates only 0.48% minor fix rate, confirming practical usability of generated repositories.
Modularity: Each generated code file is mapped to a single logical module, in strict accordance with UML/API design derived in planning, facilitating debugging and future extensibility.

6. Comparative Frameworks and Extensions

Competing and related frameworks include CodeRefine, DLPaper2Code, and AutoP2C:

CodeRefine improves LLM code implementations by retrospective retrieval and knowledge graph construction for enhanced accuracy (+6.2 pp mean TSED improvement), and features a multi-stage pipeline including chunk extraction, summarization, classification, KG construction, and refinement (Trofimova et al., 23 Aug 2024).
DLPaper2Code emphasizes direct parsing of visual architectural flow diagrams and tables to produce abstract computational graphs, then emits code in Keras/Caffe, complemented by crowdsourced editing (Sethi et al., 2017).
AutoP2C extends multimodal parsing (text, images, tables, equations), hierarchical decomposition, and feedback-driven debugging, achieving repository success rates of 8/8 compared to baselines (1/8), with detailed per-class/function completeness metrics (Lin et al., 28 Apr 2025).

7. Limitations and Prospective Directions

While Paper2Code achieves significant performance gains over existing agents and baselines, some limitations persist:

Context-window limits and modality parsing (especially equations and figures) present challenges for LLMs.
Error localization and verification remain reliant on prompt engineering and LLM-internal heuristics.
Pipelines primarily target Python; extension to other programming ecosystems is ongoing.
Future work may incorporate user-in-the-loop refinement, retrieval-augmented processing, and domain-adaptive prompting for broader scientific applicability (Seo et al., 24 Apr 2025, Lin et al., 2 Dec 2025, Trofimova et al., 23 Aug 2024, Lin et al., 28 Apr 2025).

A plausible implication is that systematic engineering and workflow decomposition, combined with collaborative verification/refinement, will be foundational principles in next-generation paper-to-code systems, with the Paper2Code framework establishing methodological standards and quantitative benchmarks for automated research reproduction.