Slides-Align1.5k Benchmark for Slide Generation

Updated 2 March 2026

Slides-Align1.5k is a human preference-aligned benchmark that standardizes slide evaluations by correlating automated metrics with aggregated human judgments.
It employs rigorous methodologies to assess content fidelity, aesthetic quality, and editability through 1,500 pairwise comparisons across nine different generation systems.
The dataset underpins reliable, reference-free evaluation of automated slide decks, guiding the advancement of computational metrics in slide generation.

Slides-Align1.5k is a human preference-aligned benchmark dataset introduced as part of the SlidesGen-Bench evaluation framework for automated slide generation systems. Designed to calibrate and validate computational, reference-free evaluation metrics across multidimensional axes—specifically Content, Aesthetics, and Editability—Slides-Align1.5k provides a rigorous ground truth based on aggregated human judgments. It encompasses outputs from nine state-of-the-art slide generation systems, evaluated across seven distinct presentation scenarios, with approximately 1,500 pairwise human preference assessments of generated slide decks. The resource serves as a key substrate for correlating and advancing quantitative slide evaluation methodologies (Yang et al., 14 Jan 2026).

1. Purpose and Scope

Slides-Align1.5k addresses a critical need for reliable and system-agnostic benchmarks in the evaluation of automatically generated slide decks. Prior approaches to benchmarking struggled with heterogeneity among generation methods and were often reliant on subjective or uncalibrated human judgments. Slides-Align1.5k implements a standardized, visually-grounded evaluation that supports universal comparison among template-driven, image-centric, and code-based generation paradigms. Its primary role is to underpin and calibrate computational metrics—specifically those for Content, Aesthetics, and Editability—enabling reproducible and generalizable assessment of slide generation systems.

2. Dataset Construction and Composition

2.1 Instruction and Scenario Curation

The dataset construction begins with the curation of instructions based on a pool of over 30,000 human-authored slide decks extracted from public repositories such as SlideShare. Decks were filtered for those containing 5–40 pages. Instructions were extracted by combining Python-pptx extraction with GPT-4o classification to yield both topic-based (94) and purpose-based (95) instructions, totaling 189 unique prompts. Each instruction was associated with a structured source document—comprising an outline and image captions—sourced from Wikipedia and curated articles to standardize input across systems.

Seven presentation scenarios are represented:

Brand Promotion
Business Plan
Course Preparation (Knowledge Teaching)
Personal Statements
Product Launch
Work Report
Topic Introduction

2.2 Slide Generation Systems

For each instruction, nine leading slide generation systems produced decks in .ppt(x) format or as rendered images. These systems span a range of paradigms:

Gamma.ai (Template-based)
Kimi-Standard (Template-based)
Kimi-Smart (Template-based)
Kimi-Banana (Image-centric)
NotebookLM (Image-centric)
Quark (Template-based)
Skywork (HTML/CSS code-driven)
Skywork-Banana (Image-centric)
Zhipu (HTML/CSS code-driven)

2.3 Inclusion Criteria

Decks were retained if (i) generation succeeded with 5–40 slides, (ii) ≥1 image was included when specified, and (iii) no system failure or timeout occurred. This procedure yielded a set of 1,701 candidate decks.

3. Human Annotation Protocol

3.1 Annotation Interface and Task

A dedicated web-based ranking interface enabled annotators to assess all nine generated decks for a given instruction side-by-side in randomized order. Tasks required constructing a total order ranking of the nine systems, with evaluation guidelines emphasizing content fidelity, visual design, layout coherence, and perceived editability.

Annotator Instructions

Annotators were specifically directed: “Rank decks from most to least preferred. Consider whether critical information is preserved, whether slides are visually engaging and legible, and whether the structure feels professional.”

3.2 Quality Controls and Redundancy

Quality controls included:

Insertion of “gold” control pairs (10% expert-ranked decks) to monitor annotator attentiveness.
Exclusion of annotators registering <80% accuracy on control pairs.
Redundancy with five independent annotators per instruction.

A total of 32 professional annotators performed 34,020 pairwise preference judgments (189 instructions × 36 pairs each, with 5 independent assessments per pair). Slides-Align1.5k is a stratified subset of 1,500 pairwise comparisons used for all alignment analyses.

3.3 Inter-Annotator Agreement

Kendall’s W (nine-way rankings): 0.67
Average pairwise Cohen’s κ: 0.55

These metrics indicate moderate to strong consensus, reflecting high reliability in the human annotations.

4. Evaluation Metrics and Correlation Analysis

Human-aligned quantitative evaluation leverages the Slides-Align1.5k judgments to assess computational metrics via standard statistical correlations:

4.1 Rank Correlation Metrics

For each instruction $i$ , let $x_i$ denote the vector of automated metric scores across all nine systems, and $y_i$ the corresponding human-generated ranks (1 = best, 9 = worst).

Pearson Correlation:

$r_i = \frac{\sum_{j=1}^{9}(x_{ij}-\bar x_i)\,(y_{ij}-\bar y_i)} {\sqrt{\sum_{j}(x_{ij}-\bar x_i)^2}\;\sqrt{\sum_{j}(y_{ij}-\bar y_i)^2}}$

Alignment reported as average $r = \frac{1}{189}\sum_i r_i$ .

Spearman Correlation:

$\rho_i = 1 - \frac{6\sum_{j=1}^9 d_{ij}^2}{9(9^2 - 1)},\quad d_{ij} = \operatorname{rank}(x_{ij}) - \operatorname{rank}(y_{ij})$

Overall: average and standard deviation across instructions.

Kendall’s Tau:

$\tau_i = \frac{C_i - D_i}{\binom{9}{2}}$

where $C_i$ and $D_i$ enumerate concordant and discordant pairs.

4.2 Comparative Results

Method	Avg Spearman ρ ↑	Std(ρ) ↓	Avg Identical ↑
SlidesGen-Bench (ours)	0.71	0.16	32.6%
LLM-as-Judge (Rating)	0.57	0.23	20.7%
PPTAgent (E-lo)	0.53	0.26	17.8%
LLM-as-Judge (Arena)	0.52	0.27	17.3%
Humans (upper bound)	0.85	0.12	45.3%

“Identical” denotes the frequency of system rankings exactly matching the human ranking for a given instruction.

5. Alignment with Individual and Composite Computational Metrics

5.1 Content Fidelity

Content fidelity was assessed using QuizBank-based LLM “open-book” accuracy, correlating with human rankings at average Spearman ρ ≈ 0.68. This high alignment underscores content preservation as a strong factor in human preference.

5.2 Aesthetic Metrics

An ablation study quantified the contribution of individual and composite aesthetic criteria:

Configuration	Avg ρ ↑	Std ρ ↓	Identical ↑
Only Engagement	0.224	0.349	14.8%
Only Harmony	0.312	0.414	15.6%
Only Usability	0.574	0.207	21.5%
Only VisualHRV	0.618	0.198	24.4%
Usability+Harmony+Engagement	0.667	0.206	24.4%
Full Method (All)	0.710	0.160	32.6%

Maximum alignment is achieved with all features (harmony, colorfulness, contrast, visual rhythm) combined.

5.3 Editability

Editability, assessed using PEI (Presentation Editability Index) levels from L0–L5, correlates with human judgments at Kendall τ ≈ 0.41. This moderate alignment suggests a partial but incomplete capture of editability’s role in perceived quality.

6. Significance and Limitations

Slides-Align1.5k provides a comprehensive, multi-system, multi-domain human preference standard for the development and calibration of slide generation evaluation metrics. The dataset demonstrates that integrated computational metrics—encompassing content fidelity, aesthetics, and editability—achieve substantially higher alignment with human preference than prior LLM-only or heuristic evaluators. The gap between editability metrics and human alignment indicates the need for future refinement, as current computational proxies only partially capture the structural usability considered by annotators.

A plausible implication is that further enrichment of editability and higher-order design proxies could increase the predictive value of automated metrics relative to human quality judgments.

7. Impact and Future Directions

Slides-Align1.5k is foundational for research on reliable, reference-free evaluation of heterogeneous slide generation systems. It has provided the empirical basis for the evaluation pipeline of SlidesGen-Bench (Yang et al., 14 Jan 2026), and the observed robust correlation with human preferences validates its role as a gold-standard benchmark. Ongoing and future work will likely refine computational editability metrics, expand scenario coverage, and address remaining alignment gaps through deeper modeling of human layout and usability preferences.

Markdown Report Issue Upgrade to Chat

References (1)

SlidesGen-Bench: Evaluating Slides Generation via Computational and Quantitative Metrics (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Slides-Align1.5k.