Reliability of LLM-based agents for end-to-end reproduction of scientific papers
Determine whether AI agents powered by large language models can reliably perform end-to-end reproduction of computational results from published scientific papers when provided only with the paper content, by autonomously extracting the methodology, implementing the described algorithms from scratch, executing the full pipeline, and generating quantitative outputs that match the original publication.
References
However, whether these agents can reliably perform end-to-end reproduction from real scientific papers remains an open question.
— PRBench: End-to-end Paper Reproduction in Physics Research
(2603.27646 - Qiu et al., 29 Mar 2026) in Abstract (page 1)