Papers

Topics

Authors

Recent

View all

Detailed Answer

Quick Answer

Concise responses based on abstracts only

Detailed Answer

Well-researched responses based on abstracts and relevant paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses

Gemini 2.5 Flash

Gemini 2.5 Flash 87 tok/s

Gemini 2.5 Pro 53 tok/s Pro

GPT-5 Medium 16 tok/s Pro

GPT-5 High 18 tok/s Pro

GPT-4o 105 tok/s Pro

GPT OSS 120B 471 tok/s Pro

Kimi K2 193 tok/s Pro

2000 character limit reached

PPTAgent: Generating and Evaluating Presentations Beyond Text-to-Slides (2501.03936v3)

Published 7 Jan 2025 in cs.AI and cs.CL

Abstract: Automatically generating presentations from documents is a challenging task that requires accommodating content quality, visual appeal, and structural coherence. Existing methods primarily focus on improving and evaluating the content quality in isolation, overlooking visual appeal and structural coherence, which limits their practical applicability. To address these limitations, we propose PPTAgent, which comprehensively improves presentation generation through a two-stage, edit-based approach inspired by human workflows. PPTAgent first analyzes reference presentations to extract slide-level functional types and content schemas, then drafts an outline and iteratively generates editing actions based on selected reference slides to create new slides. To comprehensively evaluate the quality of generated presentations, we further introduce PPTEval, an evaluation framework that assesses presentations across three dimensions: Content, Design, and Coherence. Results demonstrate that PPTAgent significantly outperforms existing automatic presentation generation methods across all three dimensions.

Collections

Summary

The paper introduces an edit-based paradigm that decomposes presentation generation into iterative, code-guided modifications anchored by reference slides.
The multi-dimensional Eval framework quantitatively and qualitatively assesses content clarity, design consistency, and narrative coherence using advanced LLMs.
Experimental results on a Zenodo10K dataset show that PPTAgent outperforms traditional methods with over a 95% success rate in slide generation tasks.

The paper presents a comprehensive framework for automatically generating slide presentations by integrating an iterative, edit-based workflow with a multi-dimensional evaluation framework. The proposed method, Agent, reconstructs presentation generation as a two-stage process that leverages reference presentations and code-based slide modifications to overcome the limitations of traditional end-to-end text-to-slide generation. The work is accompanied by Eval, a novel evaluation framework designed to assess slide quality across content, design, and coherence dimensions.

Workflow and Problem Formulation

The method reformulates presentation generation into distinct stages:

Stage I – Presentation Analysis:
- Slide Clustering: The system initially analyzes reference presentations by clustering slides into groups based on their functionalities, such as structural (e.g., opening slides) versus content-specific slides. Clustering is performed using both textual features and image similarity, with hierarchical clustering techniques for visual grouping.
- Schema Extraction: After clustering, the approach extracts content schemas using LLMs. Each slide element is characterized by its category, modality, and content. This structured schema informs subsequent editing decisions.
Stage II – Presentation Generation:
- Outline Generation: The framework generates a structured outline by mapping document sections and semantic information to reference slides. Each outline entry provides guidance on which reference slide to edit and specifies the new slide’s title, description, and associated content sections.
- Slide Generation via Editable Actions: Instead of generating slides ab initio, the system applies a sequence of executable editing actions to modify a reference slide. The formulation shifts from the conventional method
$\boldsymbol{S} = \sum_{i=1}^{n} e_i = f(C)$

where each slide is a sum of elements $e_i$ from the source content $C$ , to an edit-based approach:

$\boldsymbol{A} = \sum_{i=1}^{m} a_i = f(C \mid R_j)$

Here, each action $a_i$ (representing a snippet of executable code) is conditioned on both the input document $C$ and a reference slide $R_j$ . The method leverages specialized editing APIs that allow precise modification of text and visual elements, with slide contents represented in an HTML-based format to improve interpretability by LLMs.

Evaluation Framework – Eval

To address shortcomings in traditional metrics such as perplexity and ROUGE, the paper introduces Eval, a multi-dimensional evaluation framework that assesses generated presentations on:

Content: Evaluating the clarity, informativeness, and visual-textual integration of slide content.
Design: Measuring visual consistency, color schemes, and adherence to design principles.
Coherence: Assessing the logical progression and narrative flow of the presentation.

Eval utilizes a LLM as a judge to provide both quantitative scores (on a 1-to-5 scale) and qualitative feedback. Extensive human evaluation studies show that the Pearson correlation of ratings between human evaluators and the LLM judge exceeds 0.70, with particularly strong agreement observed in the design dimension.

Experimental Validation

The experimental section demonstrates that Agent substantially outperforms traditional presentation generation methods. Key numerical results include:

Success Rate (SR): Agent achieves a robustness with success rates over 95% for slide generation tasks (e.g., using combinations like Qwen2.5 $_{LM}$ +Qwen2-VL $_{VM}$ ).
Evaluation Metrics: Across dimensions measured in Eval, the approach achieves average scores of 3.67, with notable improvements in coherence (from a baseline of 3.28 to 4.48) and design (from 2.33 up to 3.24) when using GPT-4o.
Ablation Studies: Removing components such as the outline generation, slide schema guidance, or the code-rendering module results in marked performance degradation. For example, the removal of code render reduces the success rate from 95.0% to 74.6%, highlighting the crucial role of a structured code representation for slide edits.

The approach also benefits from a new Zenodo10K dataset—comprising 10,448 presentations spanning multiple domains (culture, education, science, society, and technology)—which aids in demonstrating the method’s scalability and versatility.

Technical Contributions and Implications

Edit-Based Paradigm: Decomposing presentation generation into iterative, code-guided edits facilitates better handling of pre-existing layout rules, visual styling, and text-image integration compared to end-to-end generation methods.
Multimodal Processing: The use of specialized LLMs and vision models (e.g., GPT-4o for language tasks and Qwen2-VL for vision tasks) illustrates that properly integrated multimodal capabilities can rival or even surpass traditional monomodal approaches.
Evaluation Rigor: The introduction of Eval provides a more nuanced evaluation of presentation quality than traditional automated metrics, incorporating aspects like narrative structure and attractiveness which are crucial for real-world presentation effectiveness.

In summary, the paper provides a detailed technical methodology that leverages structured reference analysis and an interactive, code-driven editing process to generate high-quality presentations. The proposed framework not only improves the generation fidelity across multiple dimensions but also offers a robust, scalable evaluation mechanism to guide future research in automated presentation generation.