SlidesGen-Bench: Automated Slide Evaluation Protocol
- SlidesGen-Bench Protocol is a unified, visually grounded methodology that quantifies automated slide generation using deterministic, mathematically specified metrics.
- It evaluates slide quality across content fidelity, aesthetics, and editability by converting diverse outputs into fixed-resolution bitmaps for consistent analysis.
- The protocol leverages human-aligned datasets and open-source tools to ensure reliable, reproducible benchmarking of varied generative slide approaches.
SlidesGen-Bench Protocol defines a unified, visually grounded, and computationally rigorous evaluation methodology for automated presentation slide generation. It addresses the substantial heterogeneity of generative pipelines—including code-driven, template-based, and image-centric approaches—by anchoring all evaluation in the rendered visual output. The protocol replaces subjective or opaque quality measures with deterministic, mathematically specified metrics for content fidelity, aesthetics, and editability, and establishes benchmarking reliability through human-aligned datasets and criteria. SlidesGen-Bench is grounded in open-source implementations and transparent data, enabling reproducible and extensible assessment of diverse slide generation systems (Yang et al., 14 Jan 2026).
1. Foundational Principles
SlidesGen-Bench is governed by three core principles: universality, quantification, and reliability.
- Universality: All slide generation systems, regardless of modality or architecture, are reduced to their terminal visual output. Every model’s product (PPTX, HTML, web viewer, or raster image) is rendered into a fixed-resolution bitmap and subjected to the same downstream analytic pipeline (Yang et al., 14 Jan 2026). This precludes reliance on system-dependent intermediate representations such as XML trees, CSS, or API hooks, ensuring protocol agnosticism.
- Quantification: Slide quality is operationalized as a multi-dimensional construct comprising content, aesthetics, and editability. Each dimension is mapped to one or more formal metrics, calculated deterministically on the final bitmap or derived text/region elements. All metrics are normalized, with deterministic aggregation across slides for reliable benchmarking.
- Reliability: The protocol mandates correlation of all computational metrics with human preference. To this end, it leverages Slides-Align1.5k—a dataset of human-annotated preferences for slides generated by nine state-of-the-art systems, spanning seven application scenarios. The alignment of protocol metrics with human rankings is quantified via Spearman’s ρ and “identical” ratio (exact agreement) (Yang et al., 14 Jan 2026).
2. Data Curation and Ground-Truth Schema
The protocol specifies standardized data handling to ensure reproducibility and accuracy:
- Sourcing and Filtering: PPTX decks are sampled from Zenodo10K (CC-BY 4.0), typically N=100 decks yielding 1,948 slides, with exclusion of legacy formats or sub-70% English content (Kang et al., 24 Oct 2025).
- Ground-Truth Extraction: Slides are parsed via PowerPoint XML to extract element geometry and style, followed by COM-API pass to determine exact font metrics and canonical text-box bounds. Each slide is rasterized at 960×540 px for visual debugging (Kang et al., 24 Oct 2025).
- Unified Schema: Each slide’s ground truth is exported as a JSON object specifying size, background color, and lists of texts, rects, lines, images, and tables. Coordinates are given as integers in pixels, fonts and strokes in points, and colors as hexadecimal strings (Kang et al., 24 Oct 2025).
| Field | Type/Unit | Example Value |
|---|---|---|
| size | w, h (int, px) | 960, 540 |
| background | color (#RRGGBB) | "#FBFBFB" |
| texts | x,y,w,h/font/color | array (per element) |
| rects/lines/images | geometry/color | array (per element) |
| tables | x,y,w,h/cells | array (per element) |
3. Multi-Axis Evaluation Procedures
Three complementary evaluation axes are employed:
3.1 Content Fidelity
- QuizBank Accuracy Metric: For each instruction-set, 10 multiple-choice questions (5 concept-based, 5 data-based) are generated by an LLM cascade. OCR and layout detection extract raw text claims, chart values, and visual descriptions from the rendered slide stack; an LLM evaluator answers these questions “open-book” using only the slide-derived context (Yang et al., 14 Jan 2026).
- Scoring Formula:
Aggregated over decks:
3.2 Aesthetics
- Harmony: Calculated by fitting each slide’s hue distribution (HSV, weighted by saturation) against canonical color-harmony templates using a minimum angular-deviation objective and Gaussian decay scoring (Yang et al., 14 Jan 2026).
- Engagement: Quantified via Hasler–Süsstrunk colorfulness index and pacing based on the standard deviation of colorfulness across the deck (Yang et al., 14 Jan 2026).
- Usability: Measured as figure–ground contrast using the WCAG-derived luminance contrast ratio, mapped to [0,1].
- Visual Rhythm: Computed via multi-scale steerable pyramid entropy (Lab color space) and temporal rhythm metrics (RMSSD of entropy changes across slides).
- Aggregate Score:
3.3 Editability
- PEI Taxonomy: Slide outputs are assigned a Presentation Editability Index (PEI) level (L0–L5), where L0=static image, L1="patchwork" OCR text, L2=vector shapes with text, L3=structural objects (grouping/master slide), L4=parametric (chart/SmartArt objects), and L5=cinematic (animation/media). The assigned level is the highest at which all criteria up to that level are met (Yang et al., 14 Jan 2026).
4. Element-Level and Structural Analysis
A complementary protocol, directly inspired by VLM-SlideEval, operationalizes more granular structural analysis:
- Element extraction: For each slide PNG, a VLM is prompted to output element lists (texts, rects, lines, etc.) in strict JSON format.
- Alignment via Hungarian Matching: Elements are matched between prediction and ground truth according to a blended cost matrix incorporating geometry (), center deviation, relative size, and content similarity (sequence matcher or CLIPScore). Only matches below a threshold cost are retained. Precision, recall, and are computed as standard; geometry, style, and content errors are aggregated across matches (Kang et al., 24 Oct 2025).
- Higher-Level Comprehension: Narrative structure is probed by shuffling slide order and measuring the VLM’s ability to reconstruct the correct permutation; Kendall’s , Spearman’s , and exact-match fractions are reported (Kang et al., 24 Oct 2025).
5. Robustness to Perturbation
Robustness is assessed via controlled perturbation along geometry, text, and style axes.
- Perturbation Synthesis: From “clean” seeds, perturbations are applied with controlled severity (Gaussian translation/scaling for geometry, character-level or box insertion/removal for text, font/color/style jitter for appearance).
- Manipulation Check: Adjacent-point agreement (POA_adj) and mean absolute calibration error (MACE) validate metric monotonicity with respect to perturbation severity.
- Sensitivity Metrics: Fidelity is quantified via Spearman’s with respect to real perturbation severity; consistency is quantified as the rate at which quality scores strictly decrease with greater perturbation (Kang et al., 24 Oct 2025).
6. Empirical Guidelines and Human Alignment
SlidesGen-Bench and related protocols report extensive empirical findings and provide actionable thresholds:
- Empirical Results: SlidesGen-Bench achieves mean Spearman ρ=0.71 (σ=0.16, identical=32.6%) for human alignment on Slides-Align1.5k, outperforming LLM-as-Judge and PPTAgent (PPT-Eval) pipelines (Yang et al., 14 Jan 2026).
- Thresholds: For dependable pixel-accurate audits, element extraction ≥ 0.70 and parsed coverage ≥ 0.75 are required; geometry error (1–IoU) ≤ 0.50; style ≥ 0.70; color (Kang et al., 24 Oct 2025).
- Practical Use: Scores above threshold trigger automated gating and iterative refinement in agentic pipelines; scores below threshold recommend human review or fallback to simplified templates (Kang et al., 24 Oct 2025).
7. Protocol Implementation and Reproducibility
End-to-end pipelines and artifacts are designed for reproducibility:
- Processing Stack: Input rendering is standardized via headless Office/HTML→PNG or web viewer rasterization. Subsequent stages include OCR+layout extraction (PP-DocLayout_plus–L), computational metric calculation, and PEI analysis (for PPTX sources) (Yang et al., 14 Jan 2026).
- Software Release: Full codebases—including color harmony fitting, steerable pyramid decomposition, metric parameterizations, and human annotation UIs—are open source, with random seed control for determinism.
- Dataset Accessibility: Slides-Align1.5k, all annotations, and evaluation artifacts are publicly available for benchmarking and extension. Published parameter settings (e.g., for harmony, pacing, entropy) support transparent replication and metric tuning (Yang et al., 14 Jan 2026).
The SlidesGen-Bench Protocol realizes a reproducible, quantitatively rigorous, and visually native benchmarking framework for automated slide generation, underpinned by deterministic, visually grounded metrics and validated by human-alignment studies across diverse paradigms (Yang et al., 14 Jan 2026, Kang et al., 24 Oct 2025).