SlidesGen-Bench Eval Protocol

Updated 2 March 2026

The paper introduces a rigorous evaluation protocol using headless rendering and computational metrics, achieving alignment with human judgment through quantitative aesthetics and content fidelity measures.
SlidesGen-Bench evaluates diverse slide generation paradigms—template-based, code-driven, and image-centric—based solely on their rendered PNG outputs for unbiased comparison.
The protocol leverages a large-scale human-annotated dataset and interpretable metrics such as QuizBank accuracy, aesthetic submetrics, and editability scales to ensure reproducibility and practical insights.

SlidesGen-Bench Evaluation Protocol defines a unified, computationally-grounded, and reference-free benchmark for the assessment of automated slide generation systems. Developed to address the challenges of evaluating heterogeneous generation paradigms—template-based, code-driven, and image-centric—SlidesGen-Bench enforces comparability, reproducibility, and alignment with human preference through visual-domain metrics and a large-scale human-annotated dataset. By grounding evaluation on rendered outputs and eschewing intermediate format dependencies or uncalibrated LLM judgments, it operationalizes a rigorous protocol for the next generation of slide synthesis research and deployment (Yang et al., 14 Jan 2026).

1. Core Principles: Universality, Quantification, Reliability

SlidesGen-Bench is built on three foundational tenets:

Universality: All systems, regardless of implementation (PPTX, HTML/CSS/JS, image generation), are evaluated solely on their visual outputs. Intermediate representations are ignored; each output, whether a native presentation, web export, or synthesized bitmap, is rendered headless—using python-pptx, LibreOffice, or Chromium screenshotting—to a canonical PNG stack. This guarantees architectural agnosticism and direct metric comparability (Yang et al., 14 Jan 2026).
Quantification: The framework dispenses with subjective scoring and reference bias by introducing computational, statistically normalized metrics spanning content, aesthetics, and editability, each defined reproducibly for code re-execution parity.
Reliability: Alignment with human judgment is enforced using the Slides-Align1.5k corpus—a dataset with crowd-annotated system preference rankings across multiple scenarios and generator architectures. Protocol reliability is quantified by the Spearman ρ and identical-ratio metrics, with the combined aesthetic ranking achieving ρ=0.71 and identical-ratio=32.6%, outperforming LLM-based judging approaches (ρ≤0.57, identical ≤21%) (Yang et al., 14 Jan 2026).

2. Metric Taxonomy and Definitions

Content Fidelity: QuizBank Accuracy

Content assessment leverages a “QuizBank” open-book exam-based approach. For each source document, a Gold Standard QuizBank is constructed with 10 multiple-choice questions (5 Conceptual, 5 Data-driven) tied to canonical slide content. Generated decks are parsed via OCR and JSON layout detectors into Markdown capturing extracted headlines, bullets, data, and key visuals. An LLM (GPT-4, temperature=0, fixed seed) answers the QuizBank using only the parsed content, and the primary metric is accuracy:

$\mathrm{Accuracy} = \frac{\#\text{correct\_answers}}{\#\text{total\_questions}}$

This approach normalizes content evaluation independent of layout or superficial style (Yang et al., 14 Jan 2026).

Aesthetics: Four-Factor Image Metrics

Aesthetics are evaluated using four orthogonal, image-based metrics:

Harmony: Quantifies color palette adherence using Cohen-Or template matching in HSV space:

$D_i = \min_{T,\alpha} \frac{\sum_{p \in \text{pixels}} S_p\,\mathrm{dist}(H_p, T_\alpha)}{\sum_p S_p}$

Deviations are mapped to normalized slide scores, penalized at the deck level by standard deviation across slides.

Engagement: Based on Hasler–Süsstrunk colorfulness and pacing smoothness, engagement is calculated through per-slide chromatic variance and deck-wise colorfulness fluctuation penalties:

$M_i = \sqrt{\sigma_{rg}^2+\sigma_{yb}^2} + 0.3\,\sqrt{\mu_{rg}^2+\mu_{yb}^2}$

Aggregate engagement is normalized to [0,10].

Usability: Reflects figure-ground contrast essential for visual legibility. Contrast ratio for detected text regions is calculated and scaled:

$c = \frac{L_{\max} + 0.05}{L_{\min} + 0.05}$

$S_{\mathrm{contrast}} = \frac{\ln(c)}{\ln(21)}$

Visual Rhythm (VisualHRV): Assessed through subband entropy and RMSSD of CIE-Lab pyramid features across slide sequences. Over-complexity penalization ensures rhythm scores reflect both regularity and design variation.

Total Aesthetics Score is the sum: Usability + Engagement + Harmony + VisualHRV, each scaled to 0,10.

Editability: Presentation Editability Intelligence (PEI)

Editability is measured using a deterministic taxonomy:

PEI Level	Description	Criteria
L₀	Static Image	No editable text; pure bitmap output
L₁	Patchwork	OCR-fragmented text; minimal editability
L₂	Vector	Vector primitives, lacking master-slides or structure
L₃	Structural	Uses native PPT features (<p:sldMaster>, grouping)
L₄	Parametric	Chart/table data bindings, editable SmartArt
L₅	Cinematic	Animations, multimedia, embedded controls

Recognition starts at L₅ and knocks down sequentially until a matched level is reached. The assigned PEI is the highest level satisfying all subordinate requirements (Yang et al., 14 Jan 2026).

3. Unified Visual-Domain Evaluation Pipeline

All systems are evaluated via a headless rendering and parsing pipeline:

Rendering: Native PPTX, HTML, or image outputs are normalized into PNGs. PPTX is exported via headless Office stack; HTML via Chromium full-page screenshotting.
Layout Parsing and Feature Extraction: PP-DocLayout_plus-L is used for robust cross-system detection of textual and graphical regions.
Metric Computation: Image metrics, PEI extraction, and OCR-LLM content parsing are run directly on rendered visuals or raw PPTX. All metric code is released with fixed random seeds for strict reproducibility.

This pipeline guarantees metric consistency and allows rigorous cross-paradigm comparisons (Yang et al., 14 Jan 2026).

4. Slides-Align1.5k: Human Alignment Dataset

Slides-Align1.5k provides a benchmark for preference alignment:

Dataset Composition: 1,500+ decks generated from 189 source instructions spanning seven domains (Brand, Business, Product, Work, Course, Topic, Personal), each passed through nine generation pipelines.
Annotation Protocol: Web-based interface with 3–5 annotators per instance; preferences aggregated into gold rankings.
Reliability Measurement: Spearman ρ computed between automated (SlidesGen-Bench) and human rankings, with the identical-ratio metric reporting the proportion of perfect algorithm–human agreement.

In corpus-level evaluation, SlidesGen-Bench metrics achieve unmatched performance for human alignment, with a plausible implication that computational, factorized metrics—when tuned using such datasets—outperform monolithic LLM-based scoring (Yang et al., 14 Jan 2026).

5. Empirical Results and Ablation Analyses

Experimental benchmarking on SlidesGen-Bench reveals:

Content Fidelity: Code-driven systems (Zhipu PPT) achieve top–1 QuizBank accuracy (88.3% overall), with domain-specific lows in business topics (61.6%).
Aesthetics: Image-based systems (Skywork-Banana) lead in total aesthetic score (27.28), outperforming template and code-centric architectures.
Editability: Only programmatic/native PPTX generators are assigned PEI ≥ L₃; image-based systems saturate at L₀/L₁.
Ablation Findings: Among aesthetic factors, VisualHRV correlates most strongly with human judgment (ρ=0.618), but combining all four metrics yields maximal alignment (ρ=0.71).

A plausible implication is that modular, interpretable aesthetic submetrics capture dimensions of human taste that are washed out by single-metric or black-box judgers (Yang et al., 14 Jan 2026).

6. Implementation and Reproducibility

All rendering, parsing, and evaluation code, as well as the Slides-Align1.5k dataset and parameter settings, are publicly available (https://github.com/YunqiaoYang/SlidesGen-Bench). The protocol employs fixed seeds for all stochastic processes (rendering, LLM sampling, scoring), and precise versioning of all external dependencies. The evaluation encompasses nine generation architectures, supporting robust statistical comparison and fostering reproducibility in model validation, system competition, and future extension (Yang et al., 14 Jan 2026).

7. Significance and Future Directions

SlidesGen-Bench establishes a reference-free, computationally verifiable, and visually grounded protocol for automated slide generation assessment. By anchoring itself in universal rendered output, the benchmark removes format bias and enables reliable system comparison across the rapidly evolving slide generation ecosystem. Its factorized metric design, human-alignment validation, and public releases provide a robust foundation for future developments in both model-based design tools and agentic pipeline evaluators. A plausible implication is that comparable design and evaluation protocols could generalize to other structured visual communication domains, such as document synthesis, chart generation, or educational content creation (Yang et al., 14 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

SlidesGen-Bench: Evaluating Slides Generation via Computational and Quantitative Metrics (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SlidesGen-Bench Evaluation Protocol.