PageBench: Paper-to-Page Benchmark
- PageBench is a benchmark designed to evaluate automated systems that transform academic papers into interactive webpages by providing standardized datasets, metrics, and methodologies.
- It incorporates multifaceted evaluation metrics such as content quality, semantic fidelity, and visual rendering accuracy to ensure both factual precision and visual appeal.
- The benchmark underpins multi-agent systems like AutoPage, demonstrating efficient transformation in under 15 minutes with low compute cost and rigorous verification protocols.
PageBench is a benchmark designed exclusively for evaluating automated systems that transform dense academic papers into interactive webpages. As the first benchmark for the "paper-to-page crafting" task, PageBench targets the advancement of both research communication and automated web content generation by setting standardized datasets, metrics, and methodologies. Closely associated with the AutoPage multi-agent system, PageBench enables fine-grained assessment of content fidelity, factual correctness, visual rendering, and user-perceived quality of generated webpages.
1. Scope and Dataset Composition
PageBench is constructed as a rigorously curated dataset and evaluation protocol for the task of converting academic papers into project webpages. The dataset includes over 1,500 academic papers, each paired with a human-authored project page. These pairs cover a broad spectrum of visual layouts, narrative styles, and domain topics. The benchmark incorporates both the source PDFs and their structured markdown conversions, alongside extracted assets such as figures, tables, equations, and relevant metadata.
Contextually, this dataset fills a gap: previous benchmarks focused on slides, posters, or highly templated web content, lacking coverage of interactive and visually rich web pages mapped directly from scientific manuscripts.
2. Evaluation Metrics and Methodologies
PageBench defines multifaceted metrics for content and visual quality to support systematic benchmarking:
- Content Quality:
- Readability: Measured using perplexity (PPL) over generated webpage sections.
- Semantic Fidelity: Assessed as the cosine similarity between vector embeddings of corresponding segments from the source paper and generated webpage.
- Compression-aware Information Accuracy: Combines a QA pipeline (automatic generation of factuality-checking questions and answer extraction) with text compression ratio. Systems are thus rewarded both for factual precision and concise, informative output.
Visual Quality:
- Visual Content Accuracy: Validates rendering fidelity of essential assets, such as equations, tables, and high-impact figures.
- Layout and Cohesion: Evaluates structural integrity and spatial organization for readability and aesthetic balance.
- Aesthetic Score: Determined by a Vision-LLM acting as a strict visual judge.
Multiple ablation studies, in which individual checking components are disabled, reveal that each step of the verification pipeline is essential for maximizing factual and visual accuracy.
3. AutoPage System Architecture
AutoPage is a hierarchical, multi-agent system designed to operate with PageBench. The system decomposes the paper-to-page transformation into discrete, verifiable stages:
- Narrative Planning and Structuring
- Paper Content Parser: Converts PDF to Markdown and extracts structured assets.
- Page Content Planner: Organizes markdown and assets into a narrative blueprint suited for web presentation.
- Multimodal Content Generation
- Page Content Generator: Refines text for accessibility and engagement.
- Visual Content Generator: Selects and positions figures and tables to support text contextually.
- Checker Agents: After each step, LLM/VLM-based "Checker" agents compare generated content (text and images) against the original paper.
- Interactive Page Rendering
- Page Template Matcher: Selects an appropriate HTML/CSS/JS template, filtered by user-specified tags (e.g., color scheme, navigation style).
- HTML Generator: Integrates content for a responsive, production-ready webpage, subject to optional human-in-the-loop revision.
This multi-agent orchestration aims to maximize both automation and content reliability, while permitting human correction where needed.
4. Efficiency and Cost Benchmarks
Empirical results using state-of-the-art LLMs (e.g., Gemini-2.5-Flash) demonstrate that the full AutoPage pipeline can produce a polished, interactive webpage from a research paper in under 15 minutes, at a compute cost below \$0.10. These metrics are validated on the PageBench dataset. Structural features (such as coarse-to-fine narrative assembly and content–visual alignment) are decisive in maintaining both turnaround speed and low resource consumption.
A plausible implication is that widespread adoption of AutoPage-like systems could substantially reduce the burden of project webpage creation, reallocating time and cost toward research and dissemination.
5. AI Hallucination Safeguards
To mitigate risks of hallucination and factual drift common in automated summarization and conversion, PageBench and AutoPage incorporate:
- Checker Agents: At each generation phase, outputs (text, images, tables) are systematically cross-validated by automated agents against the source document, enforcing factual consistency.
- Human-in-the-Loop Checkpoints: Researchers or authors may optionally intervene prior to final rendering, editing or confirming both textual and visual output.
This layered verification structure ensures that resulting webpages maintain semantic alignment, factual correctness, and authorial intent.
6. Experimental Validation and Impact
Experimental validation performed with the PageBench dataset highlights the critical roles of both automated and human verification. Ablation studies show marked degradation in content and visual scores when checkpoints are removed, underscoring the necessity of every pipeline component—especially full-content and HTML checkers.
PageBench is intended as a standardized evaluation protocol, facilitating comparisons across future automated systems beyond AutoPage. By enabling fine-grained multi-dimensional analysis of content fidelity, visual rendering, and layout conventions, PageBench establishes a rigorous foundation for research into automated dissemination platforms.
7. Prospective Applications and Field Advancement
The introduction of PageBench marks an advance in both research communication and AI-driven webcraft. For scientists and practitioners, automated paper-to-page conversion provides a practical means for enhancing public accessibility and engagement with complex research outputs. As the field evolves, PageBench offers a benchmark for:
- Comparing alternative paper-to-page systems and model architectures.
- Driving improvements in content summarization, visual asset management, and web rendering techniques.
- Standardizing quality metrics for scientific content presented on the web.
This suggests an ongoing shift toward collaborative AI-human systems for trustworthy, efficient, and aesthetically compelling scientific communication, as envisioned in (Ma et al., 22 Oct 2025).