Slides-Align1.5k Benchmark for Slide Generation
- Slides-Align1.5k is a human preference-aligned benchmark that standardizes slide evaluations by correlating automated metrics with aggregated human judgments.
- It employs rigorous methodologies to assess content fidelity, aesthetic quality, and editability through 1,500 pairwise comparisons across nine different generation systems.
- The dataset underpins reliable, reference-free evaluation of automated slide decks, guiding the advancement of computational metrics in slide generation.
Slides-Align1.5k is a human preference-aligned benchmark dataset introduced as part of the SlidesGen-Bench evaluation framework for automated slide generation systems. Designed to calibrate and validate computational, reference-free evaluation metrics across multidimensional axes—specifically Content, Aesthetics, and Editability—Slides-Align1.5k provides a rigorous ground truth based on aggregated human judgments. It encompasses outputs from nine state-of-the-art slide generation systems, evaluated across seven distinct presentation scenarios, with approximately 1,500 pairwise human preference assessments of generated slide decks. The resource serves as a key substrate for correlating and advancing quantitative slide evaluation methodologies (Yang et al., 14 Jan 2026).
1. Purpose and Scope
Slides-Align1.5k addresses a critical need for reliable and system-agnostic benchmarks in the evaluation of automatically generated slide decks. Prior approaches to benchmarking struggled with heterogeneity among generation methods and were often reliant on subjective or uncalibrated human judgments. Slides-Align1.5k implements a standardized, visually-grounded evaluation that supports universal comparison among template-driven, image-centric, and code-based generation paradigms. Its primary role is to underpin and calibrate computational metrics—specifically those for Content, Aesthetics, and Editability—enabling reproducible and generalizable assessment of slide generation systems.
2. Dataset Construction and Composition
2.1 Instruction and Scenario Curation
The dataset construction begins with the curation of instructions based on a pool of over 30,000 human-authored slide decks extracted from public repositories such as SlideShare. Decks were filtered for those containing 5–40 pages. Instructions were extracted by combining Python-pptx extraction with GPT-4o classification to yield both topic-based (94) and purpose-based (95) instructions, totaling 189 unique prompts. Each instruction was associated with a structured source document—comprising an outline and image captions—sourced from Wikipedia and curated articles to standardize input across systems.
Seven presentation scenarios are represented:
- Brand Promotion
- Business Plan
- Course Preparation (Knowledge Teaching)
- Personal Statements
- Product Launch
- Work Report
- Topic Introduction
2.2 Slide Generation Systems
For each instruction, nine leading slide generation systems produced decks in .ppt(x) format or as rendered images. These systems span a range of paradigms:
- Gamma.ai (Template-based)
- Kimi-Standard (Template-based)
- Kimi-Smart (Template-based)
- Kimi-Banana (Image-centric)
- NotebookLM (Image-centric)
- Quark (Template-based)
- Skywork (HTML/CSS code-driven)
- Skywork-Banana (Image-centric)
- Zhipu (HTML/CSS code-driven)
2.3 Inclusion Criteria
Decks were retained if (i) generation succeeded with 5–40 slides, (ii) ≥1 image was included when specified, and (iii) no system failure or timeout occurred. This procedure yielded a set of 1,701 candidate decks.
3. Human Annotation Protocol
3.1 Annotation Interface and Task
A dedicated web-based ranking interface enabled annotators to assess all nine generated decks for a given instruction side-by-side in randomized order. Tasks required constructing a total order ranking of the nine systems, with evaluation guidelines emphasizing content fidelity, visual design, layout coherence, and perceived editability.
Annotator Instructions
Annotators were specifically directed: “Rank decks from most to least preferred. Consider whether critical information is preserved, whether slides are visually engaging and legible, and whether the structure feels professional.”
3.2 Quality Controls and Redundancy
Quality controls included:
- Insertion of “gold” control pairs (10% expert-ranked decks) to monitor annotator attentiveness.
- Exclusion of annotators registering <80% accuracy on control pairs.
- Redundancy with five independent annotators per instruction.
A total of 32 professional annotators performed 34,020 pairwise preference judgments (189 instructions × 36 pairs each, with 5 independent assessments per pair). Slides-Align1.5k is a stratified subset of 1,500 pairwise comparisons used for all alignment analyses.
3.3 Inter-Annotator Agreement
- Kendall’s W (nine-way rankings): 0.67
- Average pairwise Cohen’s κ: 0.55
These metrics indicate moderate to strong consensus, reflecting high reliability in the human annotations.
4. Evaluation Metrics and Correlation Analysis
Human-aligned quantitative evaluation leverages the Slides-Align1.5k judgments to assess computational metrics via standard statistical correlations:
4.1 Rank Correlation Metrics
For each instruction , let denote the vector of automated metric scores across all nine systems, and the corresponding human-generated ranks (1 = best, 9 = worst).
- Pearson Correlation:
Alignment reported as average .
- Spearman Correlation:
Overall: average and standard deviation across instructions.
- Kendall’s Tau:
where and enumerate concordant and discordant pairs.
4.2 Comparative Results
| Method | Avg Spearman ρ ↑ | Std(ρ) ↓ | Avg Identical ↑ |
|---|---|---|---|
| SlidesGen-Bench (ours) | 0.71 | 0.16 | 32.6% |
| LLM-as-Judge (Rating) | 0.57 | 0.23 | 20.7% |
| PPTAgent (E-lo) | 0.53 | 0.26 | 17.8% |
| LLM-as-Judge (Arena) | 0.52 | 0.27 | 17.3% |
| Humans (upper bound) | 0.85 | 0.12 | 45.3% |
“Identical” denotes the frequency of system rankings exactly matching the human ranking for a given instruction.
5. Alignment with Individual and Composite Computational Metrics
5.1 Content Fidelity
Content fidelity was assessed using QuizBank-based LLM “open-book” accuracy, correlating with human rankings at average Spearman ρ ≈ 0.68. This high alignment underscores content preservation as a strong factor in human preference.
5.2 Aesthetic Metrics
An ablation study quantified the contribution of individual and composite aesthetic criteria:
| Configuration | Avg ρ ↑ | Std ρ ↓ | Identical ↑ |
|---|---|---|---|
| Only Engagement | 0.224 | 0.349 | 14.8% |
| Only Harmony | 0.312 | 0.414 | 15.6% |
| Only Usability | 0.574 | 0.207 | 21.5% |
| Only VisualHRV | 0.618 | 0.198 | 24.4% |
| Usability+Harmony+Engagement | 0.667 | 0.206 | 24.4% |
| Full Method (All) | 0.710 | 0.160 | 32.6% |
Maximum alignment is achieved with all features (harmony, colorfulness, contrast, visual rhythm) combined.
5.3 Editability
Editability, assessed using PEI (Presentation Editability Index) levels from L0–L5, correlates with human judgments at Kendall τ ≈ 0.41. This moderate alignment suggests a partial but incomplete capture of editability’s role in perceived quality.
6. Significance and Limitations
Slides-Align1.5k provides a comprehensive, multi-system, multi-domain human preference standard for the development and calibration of slide generation evaluation metrics. The dataset demonstrates that integrated computational metrics—encompassing content fidelity, aesthetics, and editability—achieve substantially higher alignment with human preference than prior LLM-only or heuristic evaluators. The gap between editability metrics and human alignment indicates the need for future refinement, as current computational proxies only partially capture the structural usability considered by annotators.
A plausible implication is that further enrichment of editability and higher-order design proxies could increase the predictive value of automated metrics relative to human quality judgments.
7. Impact and Future Directions
Slides-Align1.5k is foundational for research on reliable, reference-free evaluation of heterogeneous slide generation systems. It has provided the empirical basis for the evaluation pipeline of SlidesGen-Bench (Yang et al., 14 Jan 2026), and the observed robust correlation with human preferences validates its role as a gold-standard benchmark. Ongoing and future work will likely refine computational editability metrics, expand scenario coverage, and address remaining alignment gaps through deeper modeling of human layout and usability preferences.