REPRO-Bench: Agentic AI Benchmark
- REPRO-Bench is a publicly available benchmark that evaluates AI systems for end-to-end reproducibility assessment in social science research.
- It requires agents to parse heterogeneous documents, execute provided analyses, and compare outputs with reported findings.
- The benchmark features a structured four-phase workflow and a four-level scoring system that mirrors real-world complexity.
REPRO-Bench is a publicly available benchmark designed to evaluate and accelerate the development of agentic AI systems capable of end-to-end reproducibility assessment in social science research. Unlike prior tasks that focus only on running code with provided data, REPRO-Bench requires agents to parse heterogeneous documents, reproduce results, and verify their consistency with the reported findings under realistic, multi-format conditions. The benchmark comprises 112 task instances—each corresponding to an individual social science paper with an associated reproduction report—and is intended to mirror the complexity and diversity of real-world reproducibility evaluations (Hu et al., 25 Jul 2025).
1. Motivations and Distinctive Challenges
The impetus for REPRO-Bench is rooted in the high labor cost and time investments required for manual reproducibility assessments in social sciences. Existing benchmarks mostly test the ability to re-execute code and data provided by authors, stopping short of evaluating concordance between the outputs and reported findings. They also circumvent real-world complexity by delivering pre-processed or filtered contexts and by supporting only a limited set of data formats or programming languages—typically Python, R, or, less frequently, Stata. In contrast, REPRO-Bench introduces diverse input modalities, including varying formats (PDF, multiple code/data formats), and mandates that agents autonomously parse and reason over the complete scientific workflow.
Benchmarks that address only code rerun fail to identify discrepancies between the original publication and its reproduction, potentially missing errors in analysis or reporting that are undetectable without a full end-to-end review. REPRO-Bench addresses these gaps by requiring agents to assess the degree of consistency between regenerated outputs and key findings, thereby establishing a higher standard for AI-assisted reproducibility evaluation.
2. Benchmark Structure and Task Formulation
Each REPRO-Bench instance consists of three core components:
- The full scientific paper in PDF format.
- A reproduction package, encapsulating all data files, analysis scripts/code, and supplementary documentation.
- An explicit list of the paper’s main findings, including but not limited to tables, figures, or textual result statements.
The agent’s task is to review the material, reproduce the analysis using the provided code and data, and compare the resulting outputs to the claims and results in the original paper. The agent then assigns a reproducibility score according to a four-level ordinal scale: 1 (“major findings are irreproducible”), 2 (“minor code inconsistencies”), 3 (“minor reporting issues”), and 4 (“fully reproducible”).
This design ensures that the benchmark captures the practical workflow used by expert evaluators, which involves inspection of all files, execution of code, and systematic result comparison. The input diversity (e.g., multiple programming languages and file formats) and the necessity to extract nuanced claims from full PDFs together present realistic and substantial technical challenges.
3. Agent Evaluation and Existing Baselines
REPRO-Bench facilitates the comparative assessment of agentic AI systems in the following manner:
- Agents are required to parse the task’s PDF and associated reproduction package, execute the analysis pipeline, and score reproducibility.
- Outputs are encoded in a standardized JSON file (“reproducibility_score.json”) containing a single integer reflecting the assigned reproducibility level.
- Performance is measured by accuracy, i.e., the fraction of agent-generated scores matching the ground-truth values contained in crowdsourced or expert reproduction reports.
The benchmark includes technical compliance measures (e.g., output format validity and file structure adherence) and notes “applicability” failures, such as outputting an invalid file or unable to complete the task.
Baseline evaluations of three representative agents—AutoGPT, CORE-Agent, and SWE-Agent—reveal low overall performance. The top-performing existing agent (CORE-Agent) achieves a maximum accuracy of 21.4%. This is only marginally below the ∼25% rate expected by random guessing on a four-class task, evidencing the substantial challenge even for advanced AI agents.
4. Advances in Agent Design: REPRO-Agent
Drawing on error analysis and empirical findings from earlier baselines, the authors introduce REPRO-Agent, an advanced system that incorporates several strategic improvements:
- A four-phase structured workflow template that organizes agent reasoning: initial inspection, code/data parsing, execution, and results comparison.
- A “dummy score prediction” fallback mechanism to ensure outputs are always returned, mitigating failure modes arising from ambiguous or incomplete outcomes.
- Incorporation of few-shot in-context demonstrations that highlight common error types, such as handling of Stata-specific log file errors or misinterpretations of directory structure.
These additions collectively drive REPRO-Agent’s accuracy to 36.6%, reflecting a 71% relative improvement over the previously best baseline. Nonetheless, this remains well below practical thresholds for reliable automation, indicating the need for agentic systems with more advanced reasoning, context extraction, and verification capabilities.
5. Technical Complexity and Heterogeneity
A major feature of REPRO-Bench is its deliberate inclusion of heterogeneous data and analysis pipelines. Source papers may use R, Python, Stata, or combinations thereof, and input files can span spreadsheets, plain text, proprietary data, or structured data in compressed archives. Reproduction packages often reflect real-world messiness, including missing documentation, ambiguous file hierarchies, or partial code availability.
The scoring scheme’s explicit four-point criteria are grounded in practical reproducibility standards:
- Score 1: Major findings are irreproducible.
- Score 2: Minor code inconsistencies.
- Score 3: Minor reporting issues.
- Score 4: Fully reproducible.
Statistical analyses show that superficial attributes, such as file size or page count of the paper, are not correlated with reproducibility success (|ρ| < 0.1, Spearman correlation), confirming that reproducibility is a complex phenomenon not easily predicted by dataset or paper metadata.
6. Impact, Limitations, and Future Directions
By establishing a benchmark with full-document context, code, and data for over 100 real-world studies, REPRO-Bench provides a rigorous foundation for developing, testing, and comparing agentic AI systems for scientific reproducibility assessment. It directly enables research on end-to-end automation of the reproducibility review process, encouraging the design of AI agents capable of complex reasoning, robust document parsing, and programmatic execution across multiple technical environments.
Despite the performance improvements realized with REPRO-Agent, the best current accuracy (36.6%) remains insufficient for practical deployment. This suggests further gains must come from enhanced context management, improved multi-step reasoning, and possibly tighter coupling between information extraction and programmatic validation. Proposed research trajectories include expanding the benchmark to support perturbed or intentionally faulty packages, evaluating generalization on domains such as biology, and partially automating annotation using OCR and LLM-based claim extraction.
7. Access, Community Involvement, and Reproducibility
REPRO-Bench is publicly released at https://github.com/uiuc-kang-lab/REPRO-Bench, with guidelines specifying task format, input structure, and output requirements. The benchmark enables other researchers to contribute to the evaluation and development of AI agents for scientific reproducibility. The inclusion of diverse languages, document types, and analysis paradigms reflects the scope and heterogeneity of empirical research in the social sciences today. By adopting a clear scoring protocol and releasing both data and baseline agents, the benchmark fosters transparency and provides a reproducible testbed for ongoing advancements in AI-driven reproducibility assessment.
REPRO-Bench thus sets a new standard for systematic, automated benchmarking in reproducibility evaluation, providing the first end-to-end, realistic evaluation protocol targeted at the real-world complexities of social science research (Hu et al., 25 Jul 2025).