PaperTalker Multi-Agent Framework

Updated 13 October 2025

The paper introduces an end-to-end system that decomposes video presentation generation into specialized agent tasks, ensuring high fidelity and efficiency.
It employs advanced tree search layout refinement and parallel processing, achieving over 6× speedup while optimizing visual and informational quality.
The framework outperforms state-of-the-art methods on the Paper2Video benchmark with superior meta similarity, engagement, quiz accuracy, and identity retention.

The PaperTalker Multi-Agent Framework is an end-to-end system for the generation of academic presentation videos directly from scientific papers. It orchestrates a collaborative pipeline of specialized agents—each responsible for key presentation modalities such as slides, visual layout, subtitles, speech, cursor grounding, and talking-head rendering. The framework is evaluated on a purpose-built benchmark (Paper2Video), designed with multiple information fidelity and presentability metrics, and is configured for both quality and computational efficiency through agent-level parallelization and layout optimization.

1. System Architecture and Agent Specialization

PaperTalker decomposes the presentation video generation process into distinct roles realized by modular “builders.” Starting from the input LaTeX source (𝒟), speaker portrait (ℐ), and author audio (𝒜), PaperTalker launches the following agents:

Slide Builder: Transforms paper content into LaTeX Beamer code, compiles slides, and iteratively refines layout to resolve formatting and content positioning issues. An effective tree search visual choice mechanism systematically explores a neighborhood of layout variants, adjusting parameters such as font size and figure scaling, scoring each variant using a vision-LLM (VLM) and employing the variant with optimal visual aesthetics.
Subtitle Builder: Extracts sentence-level subtitles and generates visual focus prompts from rendered slide images using VLMs. This ensures subtitle regions reflect the most salient slide information.
Cursor Builder: Combines a GUI-grounded model (UI-TARS) with WhisperX transcription to produce spatial–temporal alignment, generating cursor trajectories that shadow the narrated content precisely.
Talker Builder: Synthesizes speech from subtitles using text-to-speech (TTS) models, conditioning on the author’s sample audio (𝒜) and synchronizes with a talking-head renderer (e.g., Hallo2, FantasyTalking) to generate lifelike presenter animations aligned to the speech track.

All these builders operate on a per-slide basis, and their outputs are synchronized for unified rendering, forming a multi-modal academic video.

2. Evaluation Metrics and Benchmarking

PaperTalker is benchmarked using four tailored metrics designed to assess not only information fidelity but also presentation and identity retention:

Meta Similarity: Measures alignment between automatically generated output and human-produced presentations, using VLM-derived slide–subtitle scores and audio embedding cosine similarities for both visual and speech channels.
PresentArena: Engages VideoLLMs as proxy audience agents for pairwise preference testing between PaperTalker and human reference videos, quantifying audience-perceived engagement and clarity by “winning rates.”
PresentQuiz: Populates multiple-choice quizzes (automatically generated from source papers) and has VideoLLMs answer using only the generated video as context. Higher quiz accuracy indicates superior informational coverage.
IP Memory: Estimates the capacity of the video to reinforce the correct association between author identity and research content, simulating conference-style “memory” tasks as an evaluative proxy for academic impact.

This suite offers fine-grained evaluation that goes beyond traditional pixel-level or ASR-based metrics, accurately profiling both informational transfer and presentability.

The Slide Builder employs an LLM to generate an initial Beamer (LaTeX) draft, compiling and debugging iteratively in response to errors (e.g., overfull warnings). For layout refinement, a rule-based local search produces a set of candidate slide variants, each rendered as an image and ranked using a VLM evaluating for readability, information hierarchy, and visual balance.

This “tree search visual choice” algorithm addresses the highly constrained requirements of research presentations—where content density, equation placement, and figure alignment are critical—by leveraging model-based perceptual metrics rather than hand-crafted heuristics.

PaperTalker parallelizes slide-agent activities, running the slide code generation, layout search, TTS, and talking-head rendering for each slide in a separate thread or process. This architectural decision yields a greater than 6× speedup over sequential processing, mitigating the scalability bottlenecks endemic to high-resource multi-modal video generation.

4. Cursor Grounding and Multimodal Synchronization

The Cursor Builder integrates spatial grounding through a GUI-aware model (UI-TARS), mapping narrated entities to their corresponding slide elements. Word-aligned timestamps (via WhisperX) enable precise spatial–temporal traces for cursor movement during narration. This innovation is significant for scientific presentations, where verbal references to equations, plots, or definitions benefit from explicit visual guidance. The resulting cursor trajectory is rendered in the final video, augmenting information salience and mimicking expert presenter behavior.

Temporal and semantic coherence across slides, subtitles, cursor, and talker are maintained through modular synchronization primitives, ensuring seamless alignment of all modalities.

5. Quantitative and Qualitative Results

On the Paper2Video benchmark of 101 papers (with corresponding human-produced presentations, slides, and speaker records), PaperTalker surpasses previous systems (including PresentAgent and Veo3) in all substantive metrics:

Highest Meta Similarity in both speech and slides.
Leading “winning rates” in PresentArena versus human and other AI-generated presentations.
Superior PresentQuiz accuracy, with some cases exceeding human presentations by 10%.
Critical ablation studies reveal that disabling agents (e.g., dropping the cursor or talking-head) causes marked performance regression in both machine and human evaluation.

A summary table of results from the experiments (content redacted for brevity, as only high-level results are presented in the data):

Metric	PaperTalker	SOTA Baseline	Human Ref.
Meta Similarity	Highest	Lower	Reference
PresentArena	Top Win	Lower	Reference
PresentQuiz	Highest	Lower	Reference
IP Memory	Best	Lower	Reference

These results substantiate PaperTalker’s superiority in faithfulness, informativeness, and engagement for academic video generation.

6. Technical Implementation and Mathematical Details

The formalization of key operations is provided in the framework:

Subtitle-to-speech conversion:

$\widetilde{\mathcal{A}}_i = \operatorname{TTS}(\{T^j_i\}_{j=1}^{m_i} \mid \mathcal{A}) \quad \text{for } i = 1, 2, \dots, n$

where $T^j_i$ are the subtitle segments for slide $i$ , $m_i$ is the number of subtitle items, and $\mathcal{A}$ the reference author audio.

The cursor trajectory function inputs spatial–temporal mappings from the VLM and WhisperX, producing per-word bounding box traces that are rendered as overlay animations synchronized to the audio and visual streams.

Tree search layout refinement is explicitly formulated as a discrete search over layout parameterizations with maximization guided by VLM scoring. All multimedia outputs are finally fused using synchronized pipelines to produce a unified video.

7. Dataset, Code Distribution, and Research Implications

The Paper2Video dataset—comprising annotated research papers, peer videos, slide files, and speaker metadata—along with the implementation code and trained agent models are available open-source at https://github.com/showlab/Paper2Video. This release lowers the barrier to scalable academic video content generation and establishes a reproducible benchmark in the literature.

As evidenced by the empirical results, the composable architecture of agent-specialized builders, domain-aligned metrics, and scalable computation defines a new paradigm for scientific multimedia communication. The framework’s modular, agent-oriented design, as well as its open evaluation system, enables systematic research and future improvements in both human and AI-mediated science communication.

PDF Markdown Chat (Pro)

Follow Topic

Get notified by email when new papers are published related to PaperTalker Multi-Agent Framework.

PaperTalker Multi-Agent Framework

1. System Architecture and Agent Specialization

2. Evaluation Metrics and Benchmarking

3. Slide Generation, Layout Refinement, and Parallelization

4. Cursor Grounding and Multimodal Synchronization

5. Quantitative and Qualitative Results

6. Technical Implementation and Mathematical Details

7. Dataset, Code Distribution, and Research Implications

Follow Topic

Continue Learning

PaperTalker Multi-Agent Framework

1. System Architecture and Agent Specialization

2. Evaluation Metrics and Benchmarking

3. Slide Generation, Layout Refinement, and Parallelization

4. Cursor Grounding and Multimodal Synchronization

5. Quantitative and Qualitative Results

6. Technical Implementation and Mathematical Details

7. Dataset, Code Distribution, and Research Implications

Follow Topic

Continue Learning

Related Topics