REPRO-Agent
- REPRO-Agent is an AI-driven system that automates reproducibility assessment in social science by handling complex multi-format research artifacts.
- It employs a four-phase workflow—inspecting files, examining code, iteratively debugging, and comparing results—to deliver nuanced fidelity scores.
- Empirical results show REPRO-Agent achieves 36.6% accuracy and 92.9% applicability, significantly outperforming previous general-purpose agents.
REPRO-Agent is an AI agentic system designed for holistic, automated assessment of research reproducibility in social science, as defined and evaluated in the context of the REPRO-Bench benchmark (Hu et al., 25 Jul 2025). It operates over full scientific papers and reproduction packages, addressing the challenges of real-world, end-to-end reproducibility verification where code, data, and documentation are interlinked, multi-format, and often require nuanced reasoning for rigorous assessment. Unlike prior reproduction automation approaches, which typically limit themselves to running code or comparing outputs in isolation, REPRO-Agent explicitly targets a broad spectrum of fidelity and consistency, incorporating structured planning, fallback mechanisms, and informed error handling to deliver reproducibility scores aligned with expert human judgment.
1. Motivation and Problem Setting
Reliable reproducibility verification is fundamental in social science, where manual efforts are prohibitively costly and labor-intensive, often involving hundreds of experts and multi-year time horizons. Existing agentic AI systems, such as AutoGPT, CORE-Agent, and SWE-Agent, have been shown to perform poorly in this domain, with best-case accuracies under 22% on REPRO-Bench’s multi-phase, real-world reproduction tasks. These tasks challenge agents to integrate information from lengthy papers (average 29 pages), large uncurated reproduction packages (mean 4.2 GB, ≈142 files), and heterogeneous configurations spanning multiple programming languages (e.g., R, Stata, Python).
REPRO-Agent is introduced to address the observed failure modes of general-purpose agents—namely poor handling of error messages, algorithmic oversimplification of reproducibility to binary outcomes, and limited ability to reason over cross-language, multi-modal research artifacts. The agent implements a workflow that aligns with expert review practices, capable of critical distinction beyond mere code execution.
2. Benchmark Overview: REPRO-Bench
REPRO-Bench functions both as the testbed and the task specification for REPRO-Agent. It contains 112 instances, each corresponding to a social science paper with a public reproduction report, the original PDF, and a complex reproduction package.
The benchmark is distinct in three respects:
- Complexity. Real-world variability is retained, including multi-format code packages (R, Stata, Python), scattered and nested directory structures, large datasets, and non-binary error classes such as inconsequential code bugs (score 2), minor output differences (score 3), and irreproducibility (score 1), in alignment with actual assessment protocols.
- Task Breadth. Agents are required to (i) parse and understand complete papers, (ii) investigate data and code integrity, (iii) execute scripts across languages and tooling, and (iv) construct an expert-grade reproducibility score (integer 1–4) written in JSON to the root directory for formal evaluation.
- Ground-Truth Reference. Human expert reports for each instance define the scoring baseline, reflecting not just raw computational output but interpretive issues around partial or flawed reproduction.
These design elements dramatically increase the difficulty and fidelity compared to prior reproduction benchmarks.
3. REPRO-Agent: Workflow and Mechanisms
REPRO-Agent distinguishes itself through integrated planning, robust error handling, and critical reasoning mechanisms optimized for the REPRO-Bench protocol.
Structured Planning (Success-Case Template):
- The agent follows a four-phase workflow:
- Initial inspection of directory structure and file presence,
- Code examination to detect language, dependencies, and error-prone segments,
- Script execution and iterative editing to resolve errors or uncover hidden issues,
- Comprehensive result comparison, emphasizing not only binary success but nuanced divergence (e.g., output format differences, minor rounding errors).
This template is operationalized with in-context few-shot examples illustrating common error types, such as hidden Stata logs and misplaced files, drawing directly from the error taxonomy described in REPRO-Bench.
Fallback Dummy Score and Applicability:
- REPRO-Agent incorporates a “dummy score” emission strategy. When the agent cannot classify an outcome due to insurmountable ambiguity, it still outputs a valid JSON result, boosting applicability rates and ensuring the process can be evaluated systematically.
Multi-Context Reasoning:
- The workflow supports sequential, multi-modal context: full PDF interpretation, file and log traversal, code and data validation, and the composite synthesis necessary for accurate reproducibility judgment.
These design choices enable REPRO-Agent to address the key failure points in prior agentic systems: understanding and acting on log files, robust directory traversal, and accurately parsing error message relevance across languages and formats.
4. Empirical Results and Comparative Performance
The performance of REPRO-Agent was evaluated comprehensively on REPRO-Bench against established agentic systems.
- Accuracy (percent of correct reproducibility scores vs. expert ground truth):
| Agent | Accuracy (%) | Applicability (%) | |-----------------|-------------|------------------| | REPRO-Agent | 36.6 | 92.9 | | CORE-Agent | 21.4 | 60.7 | | AutoGPT | 20.5 | 60.7 | | SWE-Agent | 10.7 | 53.6 |
REPRO-Agent demonstrates a 71% relative improvement over the top baseline (CORE-Agent), as quantified by:
- Error Analysis:
- The largest gains accrue from improved handling of output and error logs (especially common in Stata-based/heterogeneous packages) and from the capacity to discriminate among intermediate error types (scores 2 and 3), whereas most baselines collapse to a binary success/failure regime.
- Applicability is notably higher due to the fallback dummy score and valid output strategies, which avoid common format and timeout failures.
- Cost: The system is designed for tractability within practical inference budgets, though the cost per task is tracked in API metrics and remains within an order of magnitude of other advanced agentic systems.
5. Technical Challenges and Addressed Pitfalls
REPRO-Agent is engineered specifically to overcome major sources of practical failure in agentic reproducibility assessment:
- File and Directory Structure Ambiguity. Complex, deeply nested, or unusual layouts commonly cause errors both in file detection and log extraction. REPRO-Agent’s structured template explicitly includes recursive inspection and file mapping.
- Error Message Interpretation. Many errors only appear in log files (in Stata) or may signal transient or inconsequential issues. The agent is tuned to cross-reference error logs and distinguish between fatal and minor discrepancies.
- Multi-Language and Multi-Modal Scenarios. Cross-language package dependencies and divergent documentation conventions are addressed through workflow phases that validate across both R and Stata, as well as direct comparison to outputs specified in the reproduction report.
- Non-Binary Scoring Requirement. Human expert practice—requiring discrimination among subtle error types beyond binary outcomes—is operationalized in the template and training, improving alignment to ground-truth expectations.
6. Significance and Broader Implications
Automation of reproducibility assessment via REPRO-Agent points to several key implications and future directions:
- Research Integrity. With near-doubling of verified applicability and substantial accuracy gains, REPRO-Agent demonstrates that critical, expert-level reproducibility assessment is feasible for large-scale, automated workflows.
- Standardization. The system’s success on a benchmark encompassing real-world complexity suggests that standardized, agentic reproducibility audits could be adopted as part of research publication pipelines, especially in data- or code-intensive fields.
- Generalizability. Although initially built for REPRO-Bench and social science, the agent’s modular planning, multi-modal reasoning, and robust error handling suggest applications for reproducibility assessment in other empirical domains, particularly those typified by multi-language, multi-format research artifacts and subtle error taxonomies.
- Resource Efficiency. Increased automation can cut down the years and personnel overhead previously necessary for comprehensive reproducibility validation, promoting more consistent and widespread adoption of reproducibility best practices.
7. Limitations and Prospect for Advancement
Despite marked improvements, accuracy remains below human expert parity, with much of the gap attributable to ongoing challenges in deep error interpretation, handling highly unconventional code/data structures, and nuanced causal reasoning. Further research into advanced workflow decomposition, specialized subagents for language/tool-specific error correction, and dynamic planning under high context variance is suggested. Moreover, expanding the set of in-context error types and integrating active learning or interactive verification may help approach or exceed expert benchmark accuracy.
In sum, REPRO-Agent establishes a data-driven, empirically validated standard for automated reproducibility scoring, offering a robust baseline for ongoing research and future enhancements in agentic scientific assessment.