ConStoryBoard Dataset for Cinematic Storyboarding
- ConStoryBoard is a large-scale, shot-level annotated dataset that supports storyboard-anchored video synthesis with detailed cinematic attributes.
- It employs automated shot boundary detection and annotation via InternVL-3.5 and standardized preprocessing to ensure high-quality, high-resolution movie clips.
- The dataset structure enables both supervised and preference-based modeling of shot transitions, with benchmarks showing around 85% human-preference alignment.
ConStoryBoard is a large-scale, structurally annotated dataset of movie shots developed for the training and evaluation of start–end frame–pair prediction models within the context of cinematic multi-shot video generation. Created in support of the STAGE (SToryboard-Anchored Generation) workflow, ConStoryBoard provides high-resolution movie clip samples paired with fine-grained annotations for story progression, cinematic attributes, and human preferences, thereby enabling supervised and preference-based modeling of storyboard-anchored narrative video synthesis (Zhang et al., 13 Dec 2025).
1. Dataset Scope, Scale, and Source
ConStoryBoard comprises 101,000 shot-level samples, partitioned into 100,000 for training and 1,000 for held-out testing. Each sample consists of a single, contiguous shot excerpted from a longer movie sequence. The dataset's raw material is sourced from the open-source Condensed Movies collection (Bain et al., 2020), filtered to include only high-resolution (at least 1080p) shots with LAION aesthetic scores exceeding 5.5.
Each shot is associated with its structural storyboard specification, consisting of start and end frame pairs and a structured annotation tuple ⟨Dᵢ, Cᵢ⟩, where is a natural-language description and encodes metadata about cinematic properties. The dataset thereby supports predictive modeling of shot-level transitions anchored to explicit storyboard representations.
2. Annotation Schema
Annotations in ConStoryBoard operate exclusively at the shot level. The core schema consists of:
- Story progression description : Natural-language summary of the depicted action, generated automatically via InternVL-3.5.
- Cinematic attributes : Structured metadata including shot scale (close-up, medium, long), camera angle (high, eye, low), camera movement (static, pan, tilt, zoom), and shot length in frames.
- Human-preference judgments (ConStoryBoard-HP): For a curated subset of approximately 5,000 high-quality shots, each ground-truth start–end pair is paired with a negative sample (two frames sampled from the shot's intermediate frames), forming a preference tuple .
All annotations are serialized as structured JSON or CSV/JSONL. An example shot-level JSON annotation:
1 2 3 4 5 6 7 |
{
"shot_id": 1234,
"start_frame": "F1234_S.png",
"end_frame": "F1234_E.png",
"description": "A woman closes the door behind her and scans the empty room.",
"attributes": { "scale":"medium", "angle":"eye", "movement":"pan", "length_frames": 48 }
} |
Preference tuples are provided as:
1 |
{"shot_id": 1234, "positive_pair": ["F1234_S.png", "F1234_E.png"], "negative_pair": ["F1234_M1.png", "F1234_M2.png"]} |
3. Data Collection and Preprocessing Pipeline
The construction of ConStoryBoard is characterized by a standardized preprocessing workflow:
- Shot boundary detection: TransNetV2 is employed to identify precise shot boundaries, extracting start and end frame indices.
- Annotation generation: InternVL-3.5 generates the story progression description () and cinematic attribute metadata ().
- Keyframe extraction: For each shot, the first frame () and last frame () are extracted as keyframes.
- Post-processing: Cropping routines remove black bars, and watermark removal is performed using FFmpeg and SoraWatermarkCleaner.
A plausible implication is that this rigorous preprocessing ensures high-quality, uniform input for downstream modeling, with minimal confounding artifacts.
4. Data Organization, File Structure, and Specialized Subsets
The dataset is organized as a nested directory structure, separated into training, testing, and human-preference subsets:
/ConStoryBoard/train_v1.0/frames/shot_000001/F000001_S.pngframes/shot_000001/F000001_E.pngmetadata/shot_000001.json- ...
test_v1.0/— mirroring the training structure for 1,000 test shotshp_pairs/— containing ConStoryBoard-HP preference tuples
No encapsulated memory packs are distributed; such structures are unique to the internal model design (Eq. 1 in (Zhang et al., 13 Dec 2025)) and not part of the data release. All samples are annotated at the shot level; no additional granularities are provided.
5. Statistics, Coverage, and Benchmarks
ConStoryBoard samples exhibit broad diversity in both cinematic technique and genre. The reported coverage statistics include:
| Attribute | Distribution (%) | Notes |
|---|---|---|
| Shot scale | Close-up 30 / Medium 40 / Long 30 | Proportional coverage across categories |
| Camera movement | Static 55 / Pan 20 / Tilt 15 / Zoom 10 | Multimodal movement representation |
| Genre | Drama, action, thriller, comedy, romance | Drawn from Condensed Movies |
Baseline evaluation tasks include start–end frame–pair prediction (using supervised flow matching loss, Eq. 5) and human-preference alignment (DPO loss, Eq. 6). On the 1,000-shot test split, reported results are approximately flow-matching loss ≈ 0.XX and DPO-aligned preference accuracy ≈ 85%.
Dataset figures (see Fig. 4 in (Zhang et al., 13 Dec 2025)) provide representative examples and full histograms of shot-scale and camera movement distributions.
6. Licensing and Access
ConStoryBoard is to be released under the CC BY-NC-SA 4.0 license, pending final acceptance of the accompanying publication. The dataset, comprising all frames, metadata, and human-preference pairs, will be available upon acceptance at https://github.com/YourLab/ConStoryBoard. Further inquiries can be directed to the corresponding author at [email protected] (Zhang et al., 13 Dec 2025).
This ensures that academic researchers have access to a well-documented, large-scale resource for supervised, structured, and preference-based video generation tasks within the cinematic narrative domain.