SlidesGen-Bench: Unified Slide Benchmark
- SlidesGen-Bench is a unified benchmark for automated slide generation that evaluates systems on content, aesthetics, and editability using image-based analysis.
- It employs reproducible, closed-form metrics alongside OCR, layout detection, and semantic extraction to replace subjective assessments.
- Experimental results show strong human correlation, establishing a robust, reference-free framework for comparing diverse slide generation methodologies.
SlidesGen-Bench is a unified, computational benchmark for evaluating automated slide generation systems. It operationalizes three core principles—universality, quantification, and reliability—by grounding all evaluation in the image domain, employing reproducible closed-form metrics, and calibrating these metrics against human preference data. SlidesGen-Bench provides a standardized, reference-free testbed for comparing diverse paradigms of slide generation, encompassing code-driven layouts, image-centric synthesis, and template-based workflows (Yang et al., 14 Jan 2026).
1. Foundational Principles and Design Philosophy
SlidesGen-Bench is structured around three guiding principles:
- Universality: All slide generation outputs, regardless of their underlying representation (e.g., PowerPoint files, HTML/CSS slides, or static image sets), are rendered to raster images ("frames"). This image-domain evaluation ensures agnosticism to the generation pipeline and supports cross-system comparability.
- Quantification: The benchmark defines three rigorous evaluation axes—Content, Aesthetics, and Editability—each computed through closed-form, algorithmic pipelines. This replaces subjective heuristics and LLM judgments with reproducible metrics.
- Reliability: Each metric is validated for alignment with human preference through the Slides-Align1.5k dataset, which comprises extensive crowd-annotated rankings collected for a diverse set of generated decks. Statistical correlation between metric and human rankings is reported as the principal reliability indicator (Yang et al., 14 Jan 2026).
2. Unified Image-based Evaluation Framework
SlidesGen-Bench enforces strict input agnosticism by converting all outputs into slide images, enabling uniform downstream processing:
- Layout Detection: PaddleOCR's layout model localizes text blocks, charts, and graphical elements.
- Text and Content Extraction: Optical Character Recognition (OCR) is used for parsing text; semantic prompts to Vision-LLMs (VLMs) facilitate “open-book” content quizzing.
- Aesthetic Analysis: Extraction of quantitative descriptors including color histograms, steerable pyramid-based entropy, and luminance profiles.
- Editability Assessment: Evaluation of editable structures via inspection of underlying file formats (e.g., Office-XML) or the Document Object Model (for web-based decks).
This unified, image-centric approach guarantees that every system—regardless of internal pipeline, file format, or rendering engine—can be evaluated on strictly equal footing (Yang et al., 14 Jan 2026).
3. Quantitative Metrics: Content, Aesthetics, and Editability
SlidesGen-Bench formalizes three distinct assessment axes, each with domain-specific computation:
Content Quality
Content measures the fidelity of concept and fact transfer from the source document to the generated slides. The "QuizBank" protocol is implemented:
- For each evaluation, multiple-choice questions (5 concept, 5 data) are crafted from the source.
- Slide content is converted to Markdown and presented, along with the questions, to an LLM acting as an examiner.
- The metric is defined as:
where is the system's answer and is the human gold answer.
Aesthetics
A composite score derived from four components:
- Harmony: Color-template fit in HSV space, penalized by inter-slide hue variance.
- Engagement: Composite of deck colorfulness and pacing, with colorfulness derived from channel variances () and pacing quantified via the standard deviation of per-slide color metrics.
- Usability: Luminance contrast of text versus background, normalized by natural logarithms to .
- Visual Rhythm (VHRV): This involves subband entropy (in CIE-Lab) and the root mean square of entropy change across the slide sequence.
- The overall aesthetic score is given by:
Editability (PEI Taxonomy)
Editability is measured by the highest level achieved in the Presentation Editability Index (PEI), assessed by a "knockout" test:
| Level | Characterization (abbreviated) |
|---|---|
| L₀ | Static raster/dead text |
| L₁ | Editable text, but ungrouped |
| L₂ | Vector shapes, lacking hierarchy |
| L₃ | Structural grouping/master present |
| L₄ | Proper data-bound charts |
| L₅ | Cinematic/animated content |
The score is computed as:
4. Slides-Align1.5k: Preference-Aligned Human Evaluation
To calibrate algorithmic metrics, the Slides-Align1.5k dataset was constructed:
- It comprises over 1,500 slide decks generated from nine major systems, each evaluated on nine to ten scenarios, across seven broad real-world content domains.
- Human annotators compared all nine decks for each scenario using a web UI, producing reference quality rankings.
- Inter-human Spearman correlation is reported as ≈ 0.85 (std 0.12), with decks achieving identical ranking about 45.3% of the time (Yang et al., 14 Jan 2026).
This resource provides a critical upper bound for benchmark alignment and facilitates rigorous validation of automated metrics.
5. Experimental Results and Benchmark Baselines
SlidesGen-Bench demonstrates significantly improved alignment with human assessments relative to LLM-based and heuristic approaches:
| Method | Spearman ρ (avg) | std(ρ) | Identical Rank (%) |
|---|---|---|---|
| SlidesGen-Bench | 0.71 | 0.16 | 32.6 |
| LLM-Judge (rating) | 0.57 | 0.23 | 20.7 |
| LLM-Judge (arena/Elo) | 0.52 | 0.27 | 17.3 |
| PPTAgent | 0.53 | 0.26 | 17.8 |
| Human upper bound | 0.85 | 0.12 | 45.3 |
Ablation on aesthetic metric components finds that all four features are synergistic; the full method achieves the highest empirical human alignment. In content transfer, the Zhipu system achieves the highest QuizBank accuracy at 88.3%, and editability results cluster most systems at PEI L₂, with Quark reaching L₃ (Yang et al., 14 Jan 2026).
6. Relationship to REFLEX and Broader Benchmarking Ecosystem
REFLEX and RefSlides provide the content evaluation backbone for SlidesGen-Bench:
- RefSlides: Contains 8,111 human-made decks across non-academic domains, enforcing tight curation on topical breadth, structure, and image/text diversity (Muppidi et al., 23 May 2025).
- REFLEX Metrics: Coverage, Redundancy, Text-Image Alignment, and Flow serve as core, reference-free axes for content assessment. REFLEX demonstrates higher correlation with human preference than heuristic or chain-of-thought LLM ratings.
- The combination of SlidesGen-Bench, RefSlides, and REFLEX enables a comprehensive, multimodal, reference-free paradigm for benchmarking, supporting both automated and qualitative evaluation dimensions (Muppidi et al., 23 May 2025).
7. Future Directions and Open Challenges
SlidesGen-Bench's framework suggests several avenues for extension:
- Animated and Multimodal Slide Evaluation: Extension beyond static slides to include build-in/build-out animations or embedded multimedia.
- Domain and Language Generalization: Incorporation of domain-specific and multilingual benchmarks to address specialized and globalized use cases.
- New Metrics: Application of the negative sampling and perturbation paradigm for new axes such as visual design, clarity, engagement, and domain-specific relevance.
- Systematic Human Preference Modeling: Deepen analyses on human-machine agreement and explore alternative ranking and explanation protocols (Yang et al., 14 Jan 2026, Muppidi et al., 23 May 2025).
SlidesGen-Bench, anchored by compositional and human-aligned numeric metrics, establishes a framework for reproducible, scalable, and reliable evaluation in the rapidly advancing field of AI-driven slide generation.