ViStoryBench: Story Visualization Benchmark

Updated 25 January 2026

ViStoryBench is a comprehensive evaluation benchmark for story visualization models, assessing visual coherence and narrative alignment via structured shot scripts.
It rigorously tests models on character consistency, prompt adherence, and stylistic fidelity using diverse narrative genres and multi-shot story segments.
The benchmark supports both full and lite splits, enabling zero-shot evaluations and standardized comparisons across generative storytelling systems.

ViStoryBench is a comprehensive evaluation benchmark for story visualization models. It targets the generation of a sequence of visually coherent images that faithfully align with a provided narrative and, when relevant, with reference images of characters. The benchmark supports a multifaceted evaluation of generative models across narrative genres, visual aesthetics, and structural complexity. ViStoryBench was designed to expose models to a diverse suite of story types and to challenge them on critical dimensions such as character consistency, style adherence, and prompt alignment, providing rigorous quantitative and qualitative assessments of model performance (Zhuang et al., 30 May 2025).

1. Dataset Construction and Scope

ViStoryBench comprises 80 curated story segments sourced from a wide spectrum of narrative forms, including film/TV scripts, novels, picture books, legends, and folktales. The selection encompasses:

13 folktales, 11 fairy tales, 10 love stories, 10 social-life, 10 fantasy, 7 science-fiction, 6 historical, 4 suspense/crime, 3 horror, 3 war, 3 adventure/survival stories.
Stories range from 4 to 30 shots each (mean 16.5 shots/story) to cover both concise and complex narrative arcs.
Both single-protagonist and multi-protagonist plots are incorporated to stress-test character consistency and model capability under variable narrative load.

Character reference images are provided for 344 entities (190 real humans, 135 virtual/anime, 19 non-humans), associated with 509 images (1–10 images per character, 89 with ≥2). Gender distribution is 210 male, 108 female, and 26 unspecified or non-binary. Story styles are evenly split (39 realistic, 41 stylized), spanning live-action, CG stills, hand-drawn anime, and stylized 3D renderings.

2. Curation Methodology and Design Principles

Each story is provided as a structured shot script. Every shot includes:

Plot correspondence (mapping to specific story lines)
Setting description (temporal, spatial, and environmental context)
On-stage character specifications (0–3 per shot)
Static shot descriptions (poses, expressions)
Shot perspective parameters (distance, angle, shot type)

This structure supports precise evaluation of models on narrative coherence, geographical and visual consistency, and character fidelity. Reference images are manually selected or SDXL-generated to enforce intra-story style uniformity.

To ensure visual and narrative diversity, the dataset alternates between simple interiors and elaborate fantasy/sci-fi environments, a mix of cast sizes, and varied lighting or environmental settings (e.g., day/night, indoor/outdoor). Shot script generation is LLM-assisted (Step-1V) to ensure visual granularity, balance of shot types, and prompt precision.

3. Evaluation Metrics

ViStoryBench defines a suite of automatic metrics, all normalized to a 0–100 scale (or 0–4 rescaled as necessary). The key metrics are:

Metric	Purpose	Brief Description
CIDS	Character identification similarity	Cosine similarity between reference and generated character features (ArcFace or CLIP), both cross-image and self-similarity variants computed.
CSD	Style similarity (CSD-CLIP)	Consistency of style (color, lighting) across/generated-vs-reference images using CLIP-based embeddings.
Alignment	Prompt adherence (GPT-4.1 rated)	Consistency of generated output with static shot description, perspective, setting, and per-character actions.
OCCM	On-stage character count matching	Penalizes for extra/missing characters; computed as $\mathrm{OCCM} = 100\,\exp\!\left(-\frac{\|D-E\|}{\epsilon + E}\right)$ , where D = detected count, E = expected.
Copy-Paste	Detects direct use of reference images	Measures output-reference vs. output-alternate-reference similarity gap for single-image methods.
IS	Inception Score	Measures image diversity and recognizability.
Aesthetic	Predicts subjective image appeal	V2.5 aesthetic predictor, rescaled to 0–100.

Alignment is the mean (scaled) of four GPT-4.1-rated subcategories: global character interaction, shooting method, scene description, and per-character action. Metrics are computed using open-source scripts (Grounding DINO for cropping, CLIP/ArcFace for features, API for GPT-4.1 alignment).

Automated metric reliability is validated against human judgments, with significant correlation: Style (Kendall’s τ=0.42, Spearman ρ=0.56), CIDS (τ=0.50, ρ=0.68), Aesthetic (τ=0.26, ρ=0.40), indicating that automated metrics are informative proxies for subjective evaluation.

4. Usage Protocols and Benchmark Execution

ViStoryBench provides both “Full” (80 stories, 1317 shots) and “Lite” (20 stories, style/text-matched) splits. The benchmark is for zero-shot or fine-tuned evaluation—no fixed train/validation split is prescribed. Models may leverage any external training data, but evaluation and comparison are always performed on the provided splits.

The standardized evaluation pipeline consists of:

For each story and shot, provide the model with the structured shot script and relevant character reference images.
Generate one image per shot at 16:9 aspect ratio (e.g., 1920×1080).
Place outputs in a designated folder structure.
Run the provided evaluation scripts, which compute all 12 metrics and collate per-metric and average-rank scores.
Optionally submit results for leaderboard ranking or online comparison.

The codebase (https://github.com/vistorybench/vistorybench), dataset (https://huggingface.co/datasets/ViStoryBench/ViStoryBench), and an interactive Story Explorer are openly available (Zhuang et al., 30 May 2025).

5. Baseline Results and Comparative Insights

Baseline results illustrate trade-offs between open-source, proprietary, and trivial-copying approaches:

Copy-Paste Baseline: Attains near-perfect CIDS (≈96), CSD self-similarity (≈98), OCCM (≈100), but fails on alignment (≈26) and has poor IS (≈6.7), Aesthetic (≈4.48). High visual/style identity does not entail narrative correctness.
UNO (open-source, many2many): Balances performance with CIDS_self (≈67), CIDS_cross (≈39), CSD_self (≈61), CSD_cross (≈41), Alignment (≈70), OCCM (≈90), IS (≈12.4), Aesthetic (≈5.23). Lacks copy-paste artefacts and demonstrates multi-shot coherence.
StoryDiffusion (text-only): Achieves highest generic image quality (IS ≈15.7, Aesthetic ≈5.76) but weaker on character/identity metrics (CIDS_cross ≈30), confirming text-only methods' deficiency in specific character fidelity.
GPT-4o (commercial LLM, image-ref): Leads in Alignment (≈89), OCCM (≈93), with strong CIDS_cross (≈53), CIDS_self (≈73), Aesthetic (≈5.52). Proprietary MLLMs maintain prompt adherence and character accuracy.
Doubao (commercial): High Alignment (≈84), CIDS metrics moderately strong. Similar trends as GPT-4o.

Findings confirm that a multi-dimensional metric suite is required: models excelling only in style/identity or only in prompt adherence fail on comprehensive benchmarks. Proprietary MLLMs currently set the standard for adherence and accuracy, though open-source alternatives such as UNO provide robust, reproducible baselines.

6. Open Problems and Future Directions

ViStoryBench identifies several challenges and avenues for further research:

Long-range narrative and visual coherence: Improving self-similarity (CIDS/CSD) across entire stories without losing shot-level prompt accuracy.
Extension to multi-modal outputs and video: While currently image-based, the underlying evaluation philosophy is adaptable to video and temporally extended story generation.
Advanced evaluation metrics: Exploration of learned or hybrid (vision-language) metrics with higher correlation to human preferences, especially concerning narrative and character continuity.
Ethical and fairness considerations: Addressing potential biases when scaling benchmarks with web-scraped corpora or synthetic augmentation.
Efficient annotation strategies: Semi-automatic and active learning protocols may accelerate dataset extension and curation, especially for high-complexity or long-form narratives.

ViStoryBench provides a rigorously defined suite for benchmarking story visualization models, supporting systematic progress in visually grounded narrative generation research (Zhuang et al., 30 May 2025).

Markdown Report Issue Upgrade to Chat

References (1)

ViStoryBench: Comprehensive Benchmark Suite for Story Visualization (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ViStoryBench.