Papers
Topics
Authors
Recent
Search
2000 character limit reached

SlidesGen-Bench: Unified Slide Benchmark

Updated 21 January 2026
  • SlidesGen-Bench is a unified benchmark for automated slide generation that evaluates systems on content, aesthetics, and editability using image-based analysis.
  • It employs reproducible, closed-form metrics alongside OCR, layout detection, and semantic extraction to replace subjective assessments.
  • Experimental results show strong human correlation, establishing a robust, reference-free framework for comparing diverse slide generation methodologies.

SlidesGen-Bench is a unified, computational benchmark for evaluating automated slide generation systems. It operationalizes three core principles—universality, quantification, and reliability—by grounding all evaluation in the image domain, employing reproducible closed-form metrics, and calibrating these metrics against human preference data. SlidesGen-Bench provides a standardized, reference-free testbed for comparing diverse paradigms of slide generation, encompassing code-driven layouts, image-centric synthesis, and template-based workflows (Yang et al., 14 Jan 2026).

1. Foundational Principles and Design Philosophy

SlidesGen-Bench is structured around three guiding principles:

  • Universality: All slide generation outputs, regardless of their underlying representation (e.g., PowerPoint files, HTML/CSS slides, or static image sets), are rendered to raster images ("frames"). This image-domain evaluation ensures agnosticism to the generation pipeline and supports cross-system comparability.
  • Quantification: The benchmark defines three rigorous evaluation axes—Content, Aesthetics, and Editability—each computed through closed-form, algorithmic pipelines. This replaces subjective heuristics and LLM judgments with reproducible metrics.
  • Reliability: Each metric is validated for alignment with human preference through the Slides-Align1.5k dataset, which comprises extensive crowd-annotated rankings collected for a diverse set of generated decks. Statistical correlation between metric and human rankings is reported as the principal reliability indicator (Yang et al., 14 Jan 2026).

2. Unified Image-based Evaluation Framework

SlidesGen-Bench enforces strict input agnosticism by converting all outputs into slide images, enabling uniform downstream processing:

  • Layout Detection: PaddleOCR's layout model localizes text blocks, charts, and graphical elements.
  • Text and Content Extraction: Optical Character Recognition (OCR) is used for parsing text; semantic prompts to Vision-LLMs (VLMs) facilitate “open-book” content quizzing.
  • Aesthetic Analysis: Extraction of quantitative descriptors including color histograms, steerable pyramid-based entropy, and luminance profiles.
  • Editability Assessment: Evaluation of editable structures via inspection of underlying file formats (e.g., Office-XML) or the Document Object Model (for web-based decks).

This unified, image-centric approach guarantees that every system—regardless of internal pipeline, file format, or rendering engine—can be evaluated on strictly equal footing (Yang et al., 14 Jan 2026).

3. Quantitative Metrics: Content, Aesthetics, and Editability

SlidesGen-Bench formalizes three distinct assessment axes, each with domain-specific computation:

Content Quality

Content measures the fidelity of concept and fact transfer from the source document to the generated slides. The "QuizBank" protocol is implemented:

  • For each evaluation, NN multiple-choice questions (5 concept, 5 data) are crafted from the source.
  • Slide content is converted to Markdown and presented, along with the questions, to an LLM acting as an examiner.
  • The metric is defined as:

Scontent=1Ni=1N1(y^i=yigt)S_{\mathrm{content}} = \frac{1}{N} \sum_{i=1}^N \mathbf{1}\left(\hat y_i = y_i^{\mathrm{gt}}\right)

where y^i\hat y_i is the system's answer and yigty_i^{\mathrm{gt}} is the human gold answer.

Aesthetics

A composite score derived from four components:

  • Harmony: Color-template fit in HSV space, penalized by inter-slide hue variance.
  • Engagement: Composite of deck colorfulness and pacing, with colorfulness derived from channel variances (rg,ybrg, yb) and pacing quantified via the standard deviation of per-slide color metrics.
  • Usability: Luminance contrast of text versus background, normalized by natural logarithms to [0,1][0,1].
  • Visual Rhythm (VHRV): This involves subband entropy (in CIE-Lab) and the root mean square of entropy change across the slide sequence.
  • The overall aesthetic score SaesthS_{\mathrm{aesth}} is given by:

Saesth=Sharmony+Sengagement+Susability+SVHRVS_{\mathrm{aesth}} = S_{\mathrm{harmony}} + S_{\mathrm{engagement}} + S_{\mathrm{usability}} + S_{\mathrm{VHRV}}

Editability (PEI Taxonomy)

Editability is measured by the highest level achieved in the Presentation Editability Index (PEI), assessed by a "knockout" test:

Level Characterization (abbreviated)
L₀ Static raster/dead text
L₁ Editable text, but ungrouped
L₂ Vector shapes, lacking hierarchy
L₃ Structural grouping/master present
L₄ Proper data-bound charts
L₅ Cinematic/animated content

The score is computed as: Level=max{{0,,5}:C1,,C all pass}\mathrm{Level} = \max\{\ell \in \{0,\dots,5\}: C_1,\dots,C_\ell \text{ all pass}\}

4. Slides-Align1.5k: Preference-Aligned Human Evaluation

To calibrate algorithmic metrics, the Slides-Align1.5k dataset was constructed:

  • It comprises over 1,500 slide decks generated from nine major systems, each evaluated on nine to ten scenarios, across seven broad real-world content domains.
  • Human annotators compared all nine decks for each scenario using a web UI, producing reference quality rankings.
  • Inter-human Spearman correlation is reported as ≈ 0.85 (std 0.12), with decks achieving identical ranking about 45.3% of the time (Yang et al., 14 Jan 2026).

This resource provides a critical upper bound for benchmark alignment and facilitates rigorous validation of automated metrics.

5. Experimental Results and Benchmark Baselines

SlidesGen-Bench demonstrates significantly improved alignment with human assessments relative to LLM-based and heuristic approaches:

Method Spearman ρ (avg) std(ρ) Identical Rank (%)
SlidesGen-Bench 0.71 0.16 32.6
LLM-Judge (rating) 0.57 0.23 20.7
LLM-Judge (arena/Elo) 0.52 0.27 17.3
PPTAgent 0.53 0.26 17.8
Human upper bound 0.85 0.12 45.3

Ablation on aesthetic metric components finds that all four features are synergistic; the full method achieves the highest empirical human alignment. In content transfer, the Zhipu system achieves the highest QuizBank accuracy at 88.3%, and editability results cluster most systems at PEI L₂, with Quark reaching L₃ (Yang et al., 14 Jan 2026).

6. Relationship to REFLEX and Broader Benchmarking Ecosystem

REFLEX and RefSlides provide the content evaluation backbone for SlidesGen-Bench:

  • RefSlides: Contains 8,111 human-made decks across non-academic domains, enforcing tight curation on topical breadth, structure, and image/text diversity (Muppidi et al., 23 May 2025).
  • REFLEX Metrics: Coverage, Redundancy, Text-Image Alignment, and Flow serve as core, reference-free axes for content assessment. REFLEX demonstrates higher correlation with human preference than heuristic or chain-of-thought LLM ratings.
  • The combination of SlidesGen-Bench, RefSlides, and REFLEX enables a comprehensive, multimodal, reference-free paradigm for benchmarking, supporting both automated and qualitative evaluation dimensions (Muppidi et al., 23 May 2025).

7. Future Directions and Open Challenges

SlidesGen-Bench's framework suggests several avenues for extension:

  • Animated and Multimodal Slide Evaluation: Extension beyond static slides to include build-in/build-out animations or embedded multimedia.
  • Domain and Language Generalization: Incorporation of domain-specific and multilingual benchmarks to address specialized and globalized use cases.
  • New Metrics: Application of the negative sampling and perturbation paradigm for new axes such as visual design, clarity, engagement, and domain-specific relevance.
  • Systematic Human Preference Modeling: Deepen analyses on human-machine agreement and explore alternative ranking and explanation protocols (Yang et al., 14 Jan 2026, Muppidi et al., 23 May 2025).

SlidesGen-Bench, anchored by compositional and human-aligned numeric metrics, establishes a framework for reproducible, scalable, and reliable evaluation in the rapidly advancing field of AI-driven slide generation.

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SlidesGen-Bench.