Paper Reconstruction Evaluation: Evaluating Presentation and Hallucination in AI-written Papers

Published 1 Apr 2026 in cs.CL, cs.AI, and cs.LG | (2604.01128v1)

Abstract: This paper introduces the first systematic evaluation framework for quantifying the quality and risks of papers written by modern coding agents. While AI-driven paper writing has become a growing concern, rigorous evaluation of the quality and potential risks of AI-written papers remains limited, and a unified understanding of their reliability is still lacking. We introduce Paper Reconstruction Evaluation (PaperRecon), an evaluation framework in which an overview (overview.md) is created from an existing paper, after which an agent generates a full paper based on the overview and minimal additional resources, and the result is subsequently compared against the original paper. PaperRecon disentangles the evaluation of the AI-written papers into two orthogonal dimensions, Presentation and Hallucination, where Presentation is evaluated using a rubric and Hallucination is assessed via agentic evaluation grounded in the original paper source. For evaluation, we introduce PaperWrite-Bench, a benchmark of 51 papers from top-tier venues across diverse domains published after 2025. Our experiments reveal a clear trade-off: while both ClaudeCode and Codex improve with model advances, ClaudeCode achieves higher presentation quality at the cost of more than 10 hallucinations per paper on average, whereas Codex produces fewer hallucinations but lower presentation quality. This work takes a first step toward establishing evaluation frameworks for AI-driven paper writing and improving the understanding of its risks within the research community.

Abstract PDF Upgrade to Chat

Authors (6)

Summary

The paper introduces PaperRecon, a framework that quantifies both presentation quality and hallucination in AI-generated scientific writing.
It employs a section-wise rubric and agentic verification to isolate factual inaccuracies and improve evaluation fidelity.
Findings reveal a trade-off where models excelling in presentation tend to generate more hallucinations, underlining key design challenges.

Paper Reconstruction Evaluation for AI-Generated Scientific Writing

Motivation and Problem Context

Systematic and high-fidelity evaluation of AI-driven scientific writing, particularly by coding agents, presents critical challenges for research rigor and academic integrity. The proliferation of agent-authored submissions in top conferences heightens the urgency for reliable metrics that can discern not only presentational adequacy but also ground factual correctness and minimize hallucination risk. Existing approaches predominantly focus on review-based evaluations, which inadequately penalize severe fabrications and exhibit a dangerous tendency for inflated assessment of hallucinated or unsound content. The lack of an orthogonal, fine-grained evaluation protocol leaves a major gap for both empirical tracking of agent progress and risk abatement in the integration of automated writing into the research publication pipeline.

PaperRecon: Evaluation Framework and Methodology

PaperRecon is introduced as a principled, dual-axis evaluation framework for quantifying the strengths and deficiencies of AI scientific writing agents. The core protocol involves compressing a published paper into a succinct, structured research overview—augmented only with minimalistic resources (figures, tables, code, and references)—and tasking an agent to reconstruct the target manuscript from these constraints.

Figure 1: The paper reconstruction process provides a compressed paper summary and minimal resources as input for an agent, which then generates a full write-up; outputs are compared along axes of presentation quality and hallucination.

The approach isolates the writing generation capability from upstream tasks such as experimental design or reference retrieval, thus controlling for confounding factors. The reconstructed output is compared section-wise with the ground-truth manuscript to produce orthogonal measurements along two primary axes:

Presentation Quality — Assessed via rubric-based evaluation, decomposed by paper section, which scores the accuracy, clarity, and faithfulness of the reconstructed narrative to the original contributions and argumentation structure.
Hallucination — Rigorous claim-by-claim inspection is performed using LLM-based classification and agentic verification phases, isolating both major and minor factual contradictions and fabrications with reference to the full source.
Figure 2: The PaperRecon pipeline decomposes evaluation into rubric-based presentation assessment and agentic hallucination analysis, leveraging direct comparison to ground-truth papers.

The evaluation is further enhanced by citation-level analysis, detecting citation hallucinations, omissions, and erroneous attributions. All metrics are operationalized within PaperWrite-Bench, a benchmark suite curated from 51 post-2025 top-venue papers spanning ML, CV, NLP, and multimedia, supporting statistically robust agent assessment.

Experimental Results and Quantitative Findings

Comprehensive experiments evaluate Claude Code (Sonnet 4/4.6), Claude Code Agent Teams, and Codex (GPT-5/5.4). Key results demonstrate:

Claude Code achieves superior presentation quality, consistently attaining higher rubric scores across all paper sections relative to Codex. The best configuration (Sonnet 4.6) achieves a mean section score of 3.86/5, indicating effective structural and narrative preservation.
Codex exhibits lower hallucination rates. Codex (GPT-5.4) restricts major hallucinations to an average of 3 per paper, compared with over 10 for Claude Code agents, evidencing a clear trade-off: models that better emulate human writing form frequently incur greater factual risk.
Citation accuracy trends mirror hallucination trade-offs: while Claude-based agents have higher F1 on citation coverage, Codex produces fewer hallucinated attributions.
All agent families show monotonic improvement with backbone advancement (e.g., GPT-5 to GPT-5.4), evidencing the value of PaperRecon-derived metrics for longitudinal benchmarking.

These results are validated through human blind-review studies and manual hallucination verification, where rubric-based assessments exhibit a strong Kendall's $\tau_b = 0.578$ with human preference ranking, and over 96% of detected hallucinations correspond to genuine contradictions.

Analysis, Ablations, and Domain Insights

Further analysis explores the influence of research overview granularity and disciplinary context:

Increasing the detail in provided overviews systematically reduces hallucination rates and enhances presentation scores for all agents, demonstrating that input specificity is tightly coupled with achievable output fidelity.
Domain-specific evaluation reveals NLP-focused papers score highest on both axes, likely attributable to reduced mathematical and architectural complexity in this subset; conversely, multimedia and CV sections expose more structural and factual reconstruction errors.

The framework explicitly considers agent error typologies, such as error propagation in multi-step section reconstruction, and identifies current LLM/agent weaknesses in handling deep domain-specific reasoning and in grounding factually consistent numerical evidence.

Implications and Future Trajectory

PaperRecon establishes the validity and necessity of section-level, claim-grounded comparative evaluation for agentic scientific writing. The dual-axis decomposition of metrics offers clear utility for triaging agent design trade-offs, model selection, and future research on hallucination-elimination. Results reveal that advances in language agent capabilities currently incur nontrivial, and often subtle, risks for factual distortion—a contradiction to the intuition that increased model size and fluency necessarily improve reliability.

Practical implications are direct: a shift is mandated from acceptance-focused or "reviewer fooling" protocols to granular, source-grounded measurement for review-automation, publishing acceptance, and real-world agent deployment. The approach creates a testbed for the future integration of agent teams, iterative retrieval, and human-in-the-loop protocols. The limitations of fixed resource provision and the challenges of style-diversity and cross-field generalization further motivate investigations into more robust, adaptive evaluation mechanisms and the integration of retrieval/verification augmentation—both to reduce hallucination rates and to extend to settings with greater diversity of writing style and structure.

Conclusion

PaperRecon provides a rigorous, orthogonally decomposed framework for the systematic evaluation of AI-driven scientific paper generation, with strong empirical evidence establishing its discriminative validity and actionable insights into the trade-off between presentation quality and factual reliability (2604.01128). The findings call for continuous methodological refinement as agent systems become increasingly embedded in science automation workflows and highlight the enduring necessity for robust, transparent, source-grounded benchmarks in AI scientific writing.

Markdown Report Issue