AutoReproduce: Automatic AI Experiment Reproduction with Paper Lineage (2505.20662v2)

Published 27 May 2025 in cs.AI

Abstract: Efficient experiment reproduction is critical to accelerating progress in artificial intelligence. However, the inherent complexity of method design and training procedures presents substantial challenges for automation. Notably, reproducing experiments often requires implicit domain-specific knowledge not explicitly documented in the original papers. To address this, we introduce the paper lineage algorithm, which identifies and extracts implicit knowledge from the relevant references cited by the target paper. Building on this idea, we propose AutoReproduce, a multi-agent framework capable of automatically reproducing experiments described in research papers in an end-to-end manner. AutoReproduce enhances code executability by generating unit tests alongside the reproduction process. To evaluate the reproduction capability, we construct ReproduceBench, a benchmark annotated with verified implementations, and introduce novel evaluation metrics to assess both the reproduction and execution fidelity. Experimental results demonstrate that AutoReproduce outperforms the existing strong agent baselines on all five evaluation metrics by a peak margin of over $70\%$. In particular, compared to the official implementations, AutoReproduce achieves an average performance gap of $22.1\%$ on $89.74\%$ of the executable experiment runs. The code will be available at https://github.com/AI9Stars/AutoReproduce.

Summary

The paper introduces AutoReproduce, automating AI experiment reproduction through a structured workflow of literature review, paper lineage, and iterative code development.
The framework uses multi-agent collaboration to extract citation-based domain knowledge and benchmarks performance with a 22.1% improvement over official implementations.
The paper emphasizes reduced manual effort and open science, while outlining future enhancements in execution fidelity and handling complex data preprocessing.

AutoReproduce: Automatic AI Experiment Reproduction with Paper Lineage

The "AutoReproduce" framework aims to automate the reproduction of AI experiments, a crucial aspect for verifying research outputs and accelerating novel developments. The system proposes a structured workflow that extracts implicit domain knowledge from paper citations and generates executable reproduction code through a multi-agent framework.

Architecture and Workflow

AutoReproduce consists of three main subphases: Literature Review, Paper Lineage, and Code Development. The literature review phase involves comprehensive paper analysis to identify critical aspects of proposed methods and experiments. The paper lineage phase extracts implicit domain knowledge by analyzing citation relationships and associated repositories. Finally, the code development phase constructs executable code through iterative collaboration between research and code agents.

Figure 1: The paper content, instructions and data processing code (if necessary) are provided for each reproduction task. The workflow of AutoReproduce, which is decomposed into three subphases.

Implementation and Evaluation

To evaluate its efficacy, AutoReproduce introduces ReproduceBench, a benchmark featuring 13 AI papers across various sub-domains. The framework is assessed on both alignment with the source paper and execution fidelity using five distinct metrics, showcasing superior performance, particularly in code execution and performance outcomes. The results indicate AutoReproduce achieves a performance gap of 22.1% compared to official implementations.

Discussion and Future Work

The results demonstrate that AutoReproduce effectively automates experiment replication, minimizing manual labor and achieving highly reproducible code. However, challenges remain, such as refining execution fidelity and handling complex data preprocessing. Future work may focus on enhancing robustness across diverse AI methodologies and automating additional stages like data preparation.

Conclusion

AutoReproduce represents a significant development in automating AI experiment reproduction, promoting open science and efficient research validation. This framework not only facilitates reproducibility but also supports the broader adoption of AI advancements across scientific domains. The introduction of ReproduceBench further complements this objective by providing a standard for evaluating future reproduction tools.