- The paper introduces an edit-based paradigm that decomposes presentation generation into iterative, code-guided modifications anchored by reference slides.
- The multi-dimensional Eval framework quantitatively and qualitatively assesses content clarity, design consistency, and narrative coherence using advanced LLMs.
- Experimental results on a Zenodo10K dataset show that PPTAgent outperforms traditional methods with over a 95% success rate in slide generation tasks.
The paper presents a comprehensive framework for automatically generating slide presentations by integrating an iterative, edit-based workflow with a multi-dimensional evaluation framework. The proposed method, Agent, reconstructs presentation generation as a two-stage process that leverages reference presentations and code-based slide modifications to overcome the limitations of traditional end-to-end text-to-slide generation. The work is accompanied by Eval, a novel evaluation framework designed to assess slide quality across content, design, and coherence dimensions.
Workflow and Problem Formulation
The method reformulates presentation generation into distinct stages:
- Stage I – Presentation Analysis:
- Slide Clustering: The system initially analyzes reference presentations by clustering slides into groups based on their functionalities, such as structural (e.g., opening slides) versus content-specific slides. Clustering is performed using both textual features and image similarity, with hierarchical clustering techniques for visual grouping.
- Schema Extraction: After clustering, the approach extracts content schemas using LLMs. Each slide element is characterized by its category, modality, and content. This structured schema informs subsequent editing decisions.
- Stage II – Presentation Generation:
- Outline Generation: The framework generates a structured outline by mapping document sections and semantic information to reference slides. Each outline entry provides guidance on which reference slide to edit and specifies the new slide’s title, description, and associated content sections.
- Slide Generation via Editable Actions: Instead of generating slides ab initio, the system applies a sequence of executable editing actions to modify a reference slide. The formulation shifts from the conventional method
S=∑i=1n​ei​=f(C)
where each slide is a sum of elements ei​ from the source content C, to an edit-based approach:
A=i=1∑m​ai​=f(C∣Rj​)
Here, each action ai​ (representing a snippet of executable code) is conditioned on both the input document C and a reference slide Rj​. The method leverages specialized editing APIs that allow precise modification of text and visual elements, with slide contents represented in an HTML-based format to improve interpretability by LLMs.
Evaluation Framework – Eval
To address shortcomings in traditional metrics such as perplexity and ROUGE, the paper introduces Eval, a multi-dimensional evaluation framework that assesses generated presentations on:
- Content: Evaluating the clarity, informativeness, and visual-textual integration of slide content.
- Design: Measuring visual consistency, color schemes, and adherence to design principles.
- Coherence: Assessing the logical progression and narrative flow of the presentation.
Eval utilizes a LLM as a judge to provide both quantitative scores (on a 1-to-5 scale) and qualitative feedback. Extensive human evaluation studies show that the Pearson correlation of ratings between human evaluators and the LLM judge exceeds 0.70, with particularly strong agreement observed in the design dimension.
Experimental Validation
The experimental section demonstrates that Agent substantially outperforms traditional presentation generation methods. Key numerical results include:
- Success Rate (SR): Agent achieves a robustness with success rates over 95% for slide generation tasks (e.g., using combinations like Qwen2.5LM​+Qwen2-VLVM​).
- Evaluation Metrics: Across dimensions measured in Eval, the approach achieves average scores of 3.67, with notable improvements in coherence (from a baseline of 3.28 to 4.48) and design (from 2.33 up to 3.24) when using GPT-4o.
- Ablation Studies: Removing components such as the outline generation, slide schema guidance, or the code-rendering module results in marked performance degradation. For example, the removal of code render reduces the success rate from 95.0% to 74.6%, highlighting the crucial role of a structured code representation for slide edits.
The approach also benefits from a new Zenodo10K dataset—comprising 10,448 presentations spanning multiple domains (culture, education, science, society, and technology)—which aids in demonstrating the method’s scalability and versatility.
Technical Contributions and Implications
- Edit-Based Paradigm: Decomposing presentation generation into iterative, code-guided edits facilitates better handling of pre-existing layout rules, visual styling, and text-image integration compared to end-to-end generation methods.
- Multimodal Processing: The use of specialized LLMs and vision models (e.g., GPT-4o for language tasks and Qwen2-VL for vision tasks) illustrates that properly integrated multimodal capabilities can rival or even surpass traditional monomodal approaches.
- Evaluation Rigor: The introduction of Eval provides a more nuanced evaluation of presentation quality than traditional automated metrics, incorporating aspects like narrative structure and attractiveness which are crucial for real-world presentation effectiveness.
In summary, the paper provides a detailed technical methodology that leverages structured reference analysis and an interactive, code-driven editing process to generate high-quality presentations. The proposed framework not only improves the generation fidelity across multiple dimensions but also offers a robust, scalable evaluation mechanism to guide future research in automated presentation generation.