Reliability of LLM-based agents for end-to-end reproduction of scientific papers

Determine whether AI agents powered by large language models can reliably perform end-to-end reproduction of computational results from published scientific papers when provided only with the paper content, by autonomously extracting the methodology, implementing the described algorithms from scratch, executing the full pipeline, and generating quantitative outputs that match the original publication.

Background

The paper introduces PRBench, a benchmark designed to evaluate whether AI agents can read real physics papers, implement the described methods, and reproduce the reported numerical results in a sandboxed environment. While LLM-based agents demonstrate capabilities in tasks such as formula derivation and code generation, it remains unresolved whether these capabilities suffice for reliable end-to-end scientific reproduction.

PRBench provides 30 expert-validated tasks across diverse physics subfields and assesses agents on methodology understanding, code correctness, data reproduction accuracy, and task completeness. The stated open question motivates the benchmark’s design and the empirical evaluation of current systems.

References

However, whether these agents can reliably perform end-to-end reproduction from real scientific papers remains an open question.

PRBench: End-to-end Paper Reproduction in Physics Research  (2603.27646 - Qiu et al., 29 Mar 2026) in Abstract (page 1)