ConStoryBoard Dataset for Cinematic Storyboarding

Updated 20 December 2025

ConStoryBoard is a large-scale, shot-level annotated dataset that supports storyboard-anchored video synthesis with detailed cinematic attributes.
It employs automated shot boundary detection and annotation via InternVL-3.5 and standardized preprocessing to ensure high-quality, high-resolution movie clips.
The dataset structure enables both supervised and preference-based modeling of shot transitions, with benchmarks showing around 85% human-preference alignment.

ConStoryBoard is a large-scale, structurally annotated dataset of movie shots developed for the training and evaluation of start–end frame–pair prediction models within the context of cinematic multi-shot video generation. Created in support of the STAGE (SToryboard-Anchored Generation) workflow, ConStoryBoard provides high-resolution movie clip samples paired with fine-grained annotations for story progression, cinematic attributes, and human preferences, thereby enabling supervised and preference-based modeling of storyboard-anchored narrative video synthesis (Zhang et al., 13 Dec 2025).

1. Dataset Scope, Scale, and Source

ConStoryBoard comprises 101,000 shot-level samples, partitioned into 100,000 for training and 1,000 for held-out testing. Each sample consists of a single, contiguous shot excerpted from a longer movie sequence. The dataset's raw material is sourced from the open-source Condensed Movies collection (Bain et al., 2020), filtered to include only high-resolution (at least 1080p) shots with LAION aesthetic scores exceeding 5.5.

Each shot is associated with its structural storyboard specification, consisting of start and end frame pairs and a structured annotation tuple ⟨Dᵢ, Cᵢ⟩, where $D_i$ is a natural-language description and $C_i$ encodes metadata about cinematic properties. The dataset thereby supports predictive modeling of shot-level transitions anchored to explicit storyboard representations.

2. Annotation Schema

Annotations in ConStoryBoard operate exclusively at the shot level. The core schema consists of:

Story progression description $D_i$ : Natural-language summary of the depicted action, generated automatically via InternVL-3.5.
Cinematic attributes $C_i$ : Structured metadata including shot scale (close-up, medium, long), camera angle (high, eye, low), camera movement (static, pan, tilt, zoom), and shot length in frames.
Human-preference judgments (ConStoryBoard-HP): For a curated subset of approximately 5,000 high-quality shots, each ground-truth start–end pair is paired with a negative sample (two frames sampled from the shot's intermediate frames), forming a preference tuple $(y_w, y_l)$ .

All annotations are serialized as structured JSON or CSV/JSONL. An example shot-level JSON annotation:

{
  "shot_id": 1234,
  "start_frame": "F1234_S.png",
  "end_frame": "F1234_E.png",
  "description": "A woman closes the door behind her and scans the empty room.",
  "attributes": { "scale":"medium", "angle":"eye", "movement":"pan", "length_frames": 48 }
}

Preference tuples are provided as:

1	{"shot_id": 1234, "positive_pair": ["F1234_S.png", "F1234_E.png"], "negative_pair": ["F1234_M1.png", "F1234_M2.png"]}

3. Data Collection and Preprocessing Pipeline

The construction of ConStoryBoard is characterized by a standardized preprocessing workflow:

Shot boundary detection: TransNetV2 is employed to identify precise shot boundaries, extracting start and end frame indices.
Annotation generation: InternVL-3.5 generates the story progression description ( $D_i$ ) and cinematic attribute metadata ( $C_i$ ).
Keyframe extraction: For each shot, the first frame ( $F^S_i$ ) and last frame ( $F^E_i$ ) are extracted as keyframes.
Post-processing: Cropping routines remove black bars, and watermark removal is performed using FFmpeg and SoraWatermarkCleaner.

A plausible implication is that this rigorous preprocessing ensures high-quality, uniform input for downstream modeling, with minimal confounding artifacts.

4. Data Organization, File Structure, and Specialized Subsets

The dataset is organized as a nested directory structure, separated into training, testing, and human-preference subsets:

/ConStoryBoard/
- train_v1.0/
- frames/shot_000001/F000001_S.png
- frames/shot_000001/F000001_E.png
- metadata/shot_000001.json
- ...
- test_v1.0/ — mirroring the training structure for 1,000 test shots
- hp_pairs/ — containing ConStoryBoard-HP preference tuples

No encapsulated memory packs are distributed; such structures are unique to the internal model design (Eq. 1 in (Zhang et al., 13 Dec 2025)) and not part of the data release. All samples are annotated at the shot level; no additional granularities are provided.

5. Statistics, Coverage, and Benchmarks

ConStoryBoard samples exhibit broad diversity in both cinematic technique and genre. The reported coverage statistics include:

Attribute	Distribution (%)	Notes
Shot scale	Close-up 30 / Medium 40 / Long 30	Proportional coverage across categories
Camera movement	Static 55 / Pan 20 / Tilt 15 / Zoom 10	Multimodal movement representation
Genre	Drama, action, thriller, comedy, romance	Drawn from Condensed Movies

Baseline evaluation tasks include start–end frame–pair prediction (using supervised flow matching loss, Eq. 5) and human-preference alignment (DPO loss, Eq. 6). On the 1,000-shot test split, reported results are approximately flow-matching $\ell_2$ loss ≈ 0.XX and DPO-aligned preference accuracy ≈ 85%.

Dataset figures (see Fig. 4 in (Zhang et al., 13 Dec 2025)) provide representative examples and full histograms of shot-scale and camera movement distributions.

6. Licensing and Access

ConStoryBoard is to be released under the CC BY-NC-SA 4.0 license, pending final acceptance of the accompanying publication. The dataset, comprising all frames, metadata, and human-preference pairs, will be available upon acceptance at https://github.com/YourLab/ConStoryBoard. Further inquiries can be directed to the corresponding author at [email protected] (Zhang et al., 13 Dec 2025).

This ensures that academic researchers have access to a well-documented, large-scale resource for supervised, structured, and preference-based video generation tasks within the cinematic narrative domain.

PDF Markdown Chat (Pro)

References (1)

STAGE: Storyboard-Anchored Generation for Cinematic Multi-shot Narrative (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to ConStoryBoard Dataset.