ShotWeaver40K Dataset
- ShotWeaver40K is a large-scale benchmark dataset of 40,000 two-shot video clips curated to capture professional film editing patterns and enable controllable multi-shot video generation.
- It uses a comprehensive pipeline combining automated shot segmentation, image stitching, CLIP-based filtering, and hierarchical GPT-5-mini for detailed scene and shot-level annotations.
- The dataset supports research with standardized metrics like Transition Confidence Score, TrAcc, and Fréchet Video Distance, advancing cinematic language modeling and directorial control.
ShotWeaver40K is a large-scale benchmark dataset comprising 40,000 two-shot video clips curated for the task of controllable multi-shot video generation, with explicit focus on film-like editing patterns and directorial transitions. Developed to support the evaluation and training of models seeking fine-grained control over shot composition, camera parameters, and professional editing schemes, ShotWeaver40K encodes the priors of real film-editing within a rigorously annotated structure and serves as the foundation for frameworks such as ShotDirector (Wu et al., 11 Dec 2025).
1. Dataset Composition and Distribution
ShotWeaver40K consists of 40,000 two-shot video clips, resulting in a total of 80,000 individual shots and 40,000 annotated transitions—one per clip. These transitions are evenly distributed among four professional editing patterns: Cut-in, Cut-out, Shot/Reverse-Shot, and Multi-Angle, with each occupying approximately 25% of the dataset. The clips originate from 16,000 full-length films, with careful selection to capture diversity in cinematic style and shot-relation semantics.
Dataset splits are as follows:
- Training: 32,000 videos (80%)
- Validation: 4,000 videos (10%)
- Test: 4,000 videos (10%)
| Split | Videos | Cut-in | Cut-out | Shot/Rev | Multi-Angle |
|---|---|---|---|---|---|
| Train | 32,000 | 8,000 | 8,000 | 8,000 | 8,000 |
| Validation | 4,000 | 1,000 | 1,000 | 1,000 | 1,000 |
| Test | 4,000 | 1,000 | 1,000 | 1,000 | 1,000 |
The split definition is formalized as , where .
2. Data Acquisition and Annotation Pipeline
The data acquisition process integrates multiple automated modules and manual annotation stages:
- Shot Segmentation: TransNetV2 is used to segment raw material from 16,000 films into discrete shots.
- Stitching: ImageBind (threshold ) stitches visually similar segments to form two-shot sequences.
- Coarse Filtering imposes constraints: minimum frame rate (24 fps), resolution (≥720p), duration (5–12 s), and LAION-based aesthetic score evaluating frames near transitions.
- Fine Filtering further applies CLIP-based similarity () to eliminate "flash" or non-transitions, and Qwen2-VL VLM semantic continuity for logical sequence flow.
- Annotation: Hierarchical GPT-5-mini annotates each video at three levels:
- Scene-level caption
- Shot-level captions (subject, framing, style cues)
- Transition type and rationale
- Camera Parameters: For each shot, intrinsics and extrinsics are estimated by VGGT relative to shot 1.
3. Data Structure and Metadata Specification
Each video is stored in an MP4 container (H.264, 24 fps, 256×256 px). Optional masks are provided as 8-bit PNG sequences. Metadata per video is encapsulated in JSON format, capturing detailed shot and transition information:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
{
"video_id": "SW_00001",
"shots": [
{
"shot_id": 1,
"start_frame": 0,
"end_frame": 75,
"intrinsics": {"f": 50.0, "cx": 128, "cy": 128},
"extrinsics": {"R": [[…]], "t": […]},
"caption": "Medium shot of protagonist walking forward."
},
{
"shot_id": 2,
"start_frame": 76,
"end_frame": 175,
"intrinsics": {…},
"extrinsics": {…},
"caption": "Cut-in close-up of footwear tapping."
}
],
"transition": { "type": "Cut-in", "description": "Zoom to feet for emphasis." }
} |
The LaTeX-style schema formalizes key fields:
4. Editing Pattern Taxonomy
ShotWeaver40K encodes four canonical editing transitions:
- Cut-in: Immediate shift to a tighter framing of the same subject.
- Cut-out: Shift to a wider, more contextual shot.
- Shot/Reverse-Shot: Alternates between dialogue participants.
- Multi-Angle: Switches among distinct viewpoints of identical action.
Each pattern appears with frequency . For extensions to longer shot sequences, ShotWeaver40K supports analysis of co-occurrence statistics: Let denote the frequency with which transition type is followed by type ; thus, conditional probability . This structure provides a formal basis for learning cinematic language priors and modeling narrative flow.
5. Evaluation Metrics
ShotDirector introduces standardized metrics for quantifying performance in multi-shot video generation using ShotWeaver40K:
- Transition Confidence Score (TrConf):
with as TransNetV2 logit at frame .
- Transition Type Accuracy (TrAcc):
- Aesthetic Score (Aes): Predicts overall visual appeal using a LAION-trained model.
- Imaging Quality (Img): MUSIQ image quality metric.
- Semantic Alignment (Sem):
- Fréchet Video Distance (FVD): Distribution similarity for generative models.
- Semantic Consistency (SemC): Mean ViCLIP similarity for both shots.
- Visual Consistency (VisC):
- Camera Parameter Fidelity:
6. Usage Protocols and Baseline Methods
Recommended usage involves a two-stage training strategy:
- Stage I: Train exclusively on ShotWeaver40K real data, learning rate , 10,000 steps.
- Stage II: Mix 70% ShotWeaver40K and 30% SynCamVideo synthetic data, learning rate , 3,000 steps.
Evaluation prompts should use global scene captions, per-shot captions, and explicit camera pose parameters. Baseline performance (test split, Table 1) is tabulated below:
| Method | TrConf | TrAcc | Aes | Img | Sem | FVD | SemC | VisC |
|---|---|---|---|---|---|---|---|---|
| Mask²DiT | 0.223 | 0.203 | 0.596 | 0.684 | 0.780 | 69.49 | 0.780 | 0.778 |
| CineTrans | 0.798 | 0.394 | 0.631 | 0.691 | 0.791 | 71.89 | 0.792 | 0.785 |
| StoryDiffusion | – | 0.522 | 0.581 | 0.674 | 0.452 | 92.21 | 0.452 | 0.587 |
| Phantom | – | 0.621 | 0.618 | 0.679 | 0.538 | 86.61 | 0.538 | 0.571 |
| HunyuanVideo | 0.470 | 0.322 | 0.610 | 0.616 | 0.570 | 69.88 | 0.570 | 0.660 |
| Wan2.2 | 0.217 | 0.102 | 0.589 | 0.620 | 0.689 | 69.48 | 0.689 | 0.755 |
| SynCamMaster | – | 0.303 | 0.545 | 0.618 | 0.795 | 72.47 | 0.795 | 0.842 |
| ReCamMaster | 0.027 | 0.033 | 0.549 | 0.611 | – | 71.51 | – | – |
| ShotDirector | 0.896 | 0.674 | 0.637 | 0.698 | 0.792 | 68.45 | 0.792 | 0.825 |
The results demonstrate varying degrees of competency across controllability, aesthetic fidelity, and editing-pattern consistency, with ShotDirector providing the strongest metrics on this benchmark in terms of transition and semantic accuracy.
7. Research Significance and Benchmark Role
ShotWeaver40K provides a standard benchmark for developing and evaluating frameworks that address directorial control in multi-shot video generation. By explicitly encoding cinematic transition patterns and detailed camera parameters, the dataset allows systematic analysis of shot-to-shot relationships, transition modeling, and directorial intent. Its balanced taxonomy and structured metadata enable models to go beyond naive sequential consistency, advancing the generation of film-like visual narratives. A plausible implication is that ShotWeaver40K, due to its annotation richness and focus on professional editing language, will facilitate exploration of hierarchical prompt engineering, semantic alignment, and camera control in generative vision research (Wu et al., 11 Dec 2025).