Paper2Video Benchmark

Updated 13 October 2025

The paper introduces a new benchmark and dataset of 101 paper–video pairs, enabling supervised learning for multi-modal academic video synthesis.
It defines four novel evaluation metrics that measure slide fidelity, audience engagement, information transfer, and presenter identity.
The multi-agent PaperTalker framework orchestrates slide, subtitle, cursor, and talker synthesis for superior, synchronized presentation videos.

The Paper2Video Benchmark defines a comprehensive evaluation and dataset resource specifically targeting the task of automatic academic presentation video generation from scientific papers. This benchmark addresses the unique challenges presented by research-to-video pipelines, including multi-modal parsing of papers (text, figures, tables), layout coordination, and multi-channel video synthesis (slides, subtitles, speech, and presenter rendering).

1. Benchmark Definition and Dataset Structure

The Paper2Video benchmark consists of a curated set of 101 research paper–video pairs. Each instance combines:

Full LaTeX source of the research paper,
The corresponding author-recorded presentation video (with aligned slide and talking-head streams),
Original slide files (when available), and
Speaker metadata (audio samples and portrait images).

This corpus enables instance-level studies and supervised learning targeting the entire pipeline from research paper to presentation-grade video, capturing the complexity of multi-modal academic communication.

The benchmark is designed to reflect the dual goals of presentation videos: (i) effective transfer of research knowledge to an audience and (ii) enhancement of the author’s visibility (e.g., via presenter likeness and speech).

2. Evaluation Metrics

Paper2Video introduces four new, task-specific evaluation metrics tailored to academic presentation video generation:

Meta Similarity: Measures the match between generated and human-created slides, subtitles, and speech. Slide and subtitle similarity is scored on a five-point scale using a vision–LLM (VLM); speech similarity is quantified via cosine similarities in a speaker embedding space.
PresentArena: Engages VideoLLMs (large video-based LLMs) as surrogates for audience evaluation. Two videos (system-generated and human reference) are presented pairwise in both possible orders; the win rate measures which video better delivers content and engagement.
PresentQuiz: Assesses information transfer by generating multiple-choice questions from the source paper and using a VideoLLM to answer after "watching" the generated video. Quiz accuracy quantifies key idea conveyance and both local and global understanding.
IP Memory: Inspired by real-world conference scenarios, this metric quantifies whether an audience can recall the author's identity and main contribution after viewing—if a strong author–content association forms, the video receives a higher IP Memory score.

These metrics jointly capture content fidelity, engagement, educational effectiveness, and author-audience association in a multi-modal context.

3. Multi-Agent System Architecture

The framework utilizes a modular, multi-agent architecture ("PaperTalker") to address the intricacies of academic presentation video generation:

Slide Builder: Generates slides directly from LaTeX (Beamer), iteratively refining code through compilation feedback loops and optimizing layout parameters (e.g., font, scale) by conducting a "Tree Search Visual Choice." Candidate renders are scored by a VLM to resolve issues such as content overflow.
Subtitle Builder: Converts slides to natural language subtitles using a VLM tuned for summarization. Each subtitle is paired with "visual-focus" prompts, which are later used to guide cursor movement and focus.
Cursor Builder: Grounds the visual-focus prompts spatially, leveraging UI interaction models (UI-TARS) to determine on-slide coordinates. Word-level timestamps are extracted with WhisperX, aligning cursor movement temporally with spoken narration and slide transitions.
Talker Builder: Synthesizes speech using F5-TTS, adapted to the presenter's provided voice sample. For the presenter video, talking-head rendering is achieved via Hallo2 or FantasyTalking, allowing either portrait-only or upper-body animation while preserving speaker identity and ensuring natural visual performance. Slide-wise video generation is parallelized to accelerate the overall pipeline.

A key technical aspect is the tight alignment and synchronization of multimodal streams (slides, cursor, subtitles, talker) for a cohesive presentation effect.

4. Experimental Results and Ablations

Empirical results using the Paper2Video benchmark demonstrate that PaperTalker significantly outperforms baseline systems across all metrics:

Achieves the highest Meta Similarity (for slides, subtitles, and synthesized speech) scores among comparative approaches.
Yields superior pairwise winning rates in PresentArena when judged by VideoLLMs.
Outperforms human presentations by approximately 10% in PresentQuiz accuracy, also receiving comparable or better user study ratings for subjective video quality.

Ablation studies show that omitting personalized talker rendering or cursor grounding substantially degrades both information deliverability and perceived video quality, supporting the necessity of coordinated multimodal synthesis.

5. Technical and Resource Details

Notable technical features include:

Iterative slide code refinement via compilation/error feedback for robust LaTeX–to–visual rendering.
Tree search layout optimization: parameter sweeps (e.g., font size, figure scaling), candidate rendering, and VLM-based candidate selection.
Speech synthesis formalized as:

$\widetilde{A}_i = \text{TTS}(\{T_i^j\}_{j=1}^{m_i} \mid \mathcal{A}), \quad i = 1, \ldots, n$

where $T_i^j$ are subtitle sentences, $m_i$ is their number, and $\mathcal{A}$ is the input voice sample.

Engineering efficiency: Large-scale parallelization (e.g., eight NVIDIA RTX A6000 GPUs) enables more than 6× speedup over prior systems for full video pipeline generation.

The codebase, dataset (101 paper–video pairs), and all modules (agents, trained models, benchmarks) are open-sourced at https://github.com/showlab/Paper2Video.

6. Research Impact and Future Directions

The Paper2Video benchmark establishes a rigorous, multi-modal, and long-context evaluation environment for the emergent task of automated academic presentation video generation. By addressing alignment across slides, narrative, timing, and speaker identity, and introducing metrics that go substantially beyond visual fidelity, the benchmark sets a new standard for research-paper-to-video synthesis.

Future research directions suggested by these findings include:

Improving personalized presenter rendering, notably in spontaneous gestures and gaze.
Enhanced multimodal semantic alignment, especially for complex tables and mathematical notation beyond current VLM capabilities.
Expanding to longer presentations and cross-lingual synthesis scenarios, as well as refining audience-adaptive delivery.
Further benchmarking using larger and more diverse paper–video corpora.

The availability of open datasets, metrics, and code lowers the entry barrier for future developments in both automated academic communication tools and generalizable multi-modal video synthesis pipelines.

Markdown Report Issue Upgrade to Chat

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Paper2Video Benchmark.