ShotWeaver40K Dataset

Updated 17 December 2025

ShotWeaver40K is a large-scale benchmark dataset of 40,000 two-shot video clips curated to capture professional film editing patterns and enable controllable multi-shot video generation.
It uses a comprehensive pipeline combining automated shot segmentation, image stitching, CLIP-based filtering, and hierarchical GPT-5-mini for detailed scene and shot-level annotations.
The dataset supports research with standardized metrics like Transition Confidence Score, TrAcc, and Fréchet Video Distance, advancing cinematic language modeling and directorial control.

ShotWeaver40K is a large-scale benchmark dataset comprising 40,000 two-shot video clips curated for the task of controllable multi-shot video generation, with explicit focus on film-like editing patterns and directorial transitions. Developed to support the evaluation and training of models seeking fine-grained control over shot composition, camera parameters, and professional editing schemes, ShotWeaver40K encodes the priors of real film-editing within a rigorously annotated structure and serves as the foundation for frameworks such as ShotDirector (Wu et al., 11 Dec 2025).

1. Dataset Composition and Distribution

ShotWeaver40K consists of 40,000 two-shot video clips, resulting in a total of 80,000 individual shots and 40,000 annotated transitions—one per clip. These transitions are evenly distributed among four professional editing patterns: Cut-in, Cut-out, Shot/Reverse-Shot, and Multi-Angle, with each occupying approximately 25% of the dataset. The clips originate from 16,000 full-length films, with careful selection to capture diversity in cinematic style and shot-relation semantics.

Dataset splits are as follows:

Training: 32,000 videos (80%)
Validation: 4,000 videos (10%)
Test: 4,000 videos (10%)

Split	Videos	Cut-in	Cut-out	Shot/Rev	Multi-Angle
Train	32,000	8,000	8,000	8,000	8,000
Validation	4,000	1,000	1,000	1,000	1,000
Test	4,000	1,000	1,000	1,000	1,000

The split definition is formalized as $N_{\mathrm{split}} = \alpha_{\mathrm{split}} \times 40,000$ , where $(\alpha_{\mathrm{train}}, \alpha_{\mathrm{val}}, \alpha_{\mathrm{test}}) = (0.8, 0.1, 0.1)$ .

2. Data Acquisition and Annotation Pipeline

The data acquisition process integrates multiple automated modules and manual annotation stages:

Shot Segmentation: TransNetV2 is used to segment raw material from 16,000 films into discrete shots.
Stitching: ImageBind (threshold $\theta_{\mathrm{stitch}}=0.65$ ) stitches visually similar segments to form two-shot sequences.
Coarse Filtering imposes constraints: minimum frame rate (24 fps), resolution (≥720p), duration (5–12 s), and LAION-based aesthetic score evaluating frames near transitions.
Fine Filtering further applies CLIP-based similarity ( $<0.95$ ) to eliminate "flash" or non-transitions, and Qwen2-VL VLM semantic continuity for logical sequence flow.
Annotation: Hierarchical GPT-5-mini annotates each video at three levels:
- Scene-level caption
- Shot-level captions (subject, framing, style cues)
- Transition type and rationale
Camera Parameters: For each shot, intrinsics $K \in \mathbb{R}^{3 \times 3}$ and extrinsics $E = [R|t],\ R \in SO(3),\ t \in \mathbb{R}^3$ are estimated by VGGT relative to shot 1.

3. Data Structure and Metadata Specification

Each video is stored in an MP4 container (H.264, 24 fps, 256×256 px). Optional masks are provided as 8-bit PNG sequences. Metadata per video is encapsulated in JSON format, capturing detailed shot and transition information:

{
  "video_id": "SW_00001",
  "shots": [
    {
      "shot_id": 1,
      "start_frame": 0,
      "end_frame": 75,
      "intrinsics": {"f": 50.0, "cx": 128, "cy": 128},
      "extrinsics": {"R": [[…]], "t": […]},
      "caption": "Medium shot of protagonist walking forward."
    },
    {
      "shot_id": 2,
      "start_frame": 76,
      "end_frame": 175,
      "intrinsics": {…},
      "extrinsics": {…},
      "caption": "Cut-in close-up of footwear tapping."
    }
  ],
  "transition": { "type": "Cut-in", "description": "Zoom to feet for emphasis." }
}

The LaTeX-style schema formalizes key fields: $\texttt{intrinsics} = \{f, (c_x,c_y)\},\quad \texttt{extrinsics} = \{R \in \mathbb{R}^{3 \times 3}, t \in \mathbb{R}^3\},\quad \texttt{transition.type} \in \{\mathrm{Cut\text{-}in}, \mathrm{Cut\text{-}out}, \mathrm{Shot/Rev}, \mathrm{Multi}\}$

4. Editing Pattern Taxonomy

ShotWeaver40K encodes four canonical editing transitions:

Cut-in: Immediate shift to a tighter framing of the same subject.
Cut-out: Shift to a wider, more contextual shot.
Shot/Reverse-Shot: Alternates between dialogue participants.
Multi-Angle: Switches among distinct viewpoints of identical action.

Each pattern appears with frequency $f_i \approx 0.25$ . For extensions to longer shot sequences, ShotWeaver40K supports analysis of co-occurrence statistics: Let $N_{ij}$ denote the frequency with which transition type $i$ is followed by type $j$ ; thus, conditional probability $P_{ij} = \frac{N_{ij}}{\sum_k N_{ik}}$ . This structure provides a formal basis for learning cinematic language priors and modeling narrative flow.

5. Evaluation Metrics

ShotDirector introduces standardized metrics for quantifying performance in multi-shot video generation using ShotWeaver40K:

Transition Confidence Score (TrConf):

$\mathrm{TrConf} = \max_f \sigma(d_f)$

with $d_f$ as TransNetV2 logit at frame $f$ .

Transition Type Accuracy (TrAcc):

$\mathrm{TrAcc} = \frac{\#\{\text{correctly classified types}\}}{N_{\mathrm{eval}}}$

Aesthetic Score (Aes): Predicts overall visual appeal using a LAION-trained model.
Imaging Quality (Img): MUSIQ image quality metric.
Semantic Alignment (Sem):

$\mathrm{Sem} = \cos\big(E_{\mathrm{ViCLIP}}(\text{video}), E_{\mathrm{ViCLIP}}(\text{text})\big)$

Fréchet Video Distance (FVD): Distribution similarity for generative models.
Semantic Consistency (SemC): Mean ViCLIP similarity for both shots.
Visual Consistency (VisC):

$\mathrm{VisC} = \frac{1}{2}\left(\cos(S_{\mathrm{subj},1}, S_{\mathrm{subj},2}) + \cos(B_{\mathrm{bkg},1}, B_{\mathrm{bkg},2})\right)$

Camera Parameter Fidelity:

$\mathrm{RotErr} = \frac{1}{N} \sum_i \| R^{\mathrm{pred}}_i - R^{\mathrm{gt}}_i \|_F,\quad \mathrm{TransErr} = \frac{1}{N} \sum_i \| t^{\mathrm{pred}}_i - t^{\mathrm{gt}}_i \|_2$

6. Usage Protocols and Baseline Methods

Recommended usage involves a two-stage training strategy:

Stage I: Train exclusively on ShotWeaver40K real data, learning rate $1 \times 10^{-4}$ , 10,000 steps.
Stage II: Mix 70% ShotWeaver40K and 30% SynCamVideo synthetic data, learning rate $5 \times 10^{-5}$ , 3,000 steps.

Evaluation prompts should use global scene captions, per-shot captions, and explicit camera pose parameters. Baseline performance (test split, Table 1) is tabulated below:

Method	TrConf	TrAcc	Aes	Img	Sem	FVD	SemC	VisC
Mask²DiT	0.223	0.203	0.596	0.684	0.780	69.49	0.780	0.778
CineTrans	0.798	0.394	0.631	0.691	0.791	71.89	0.792	0.785
StoryDiffusion	–	0.522	0.581	0.674	0.452	92.21	0.452	0.587
Phantom	–	0.621	0.618	0.679	0.538	86.61	0.538	0.571
HunyuanVideo	0.470	0.322	0.610	0.616	0.570	69.88	0.570	0.660
Wan2.2	0.217	0.102	0.589	0.620	0.689	69.48	0.689	0.755
SynCamMaster	–	0.303	0.545	0.618	0.795	72.47	0.795	0.842
ReCamMaster	0.027	0.033	0.549	0.611	–	71.51	–	–
ShotDirector	0.896	0.674	0.637	0.698	0.792	68.45	0.792	0.825

The results demonstrate varying degrees of competency across controllability, aesthetic fidelity, and editing-pattern consistency, with ShotDirector providing the strongest metrics on this benchmark in terms of transition and semantic accuracy.

7. Research Significance and Benchmark Role

ShotWeaver40K provides a standard benchmark for developing and evaluating frameworks that address directorial control in multi-shot video generation. By explicitly encoding cinematic transition patterns and detailed camera parameters, the dataset allows systematic analysis of shot-to-shot relationships, transition modeling, and directorial intent. Its balanced taxonomy and structured metadata enable models to go beyond naive sequential consistency, advancing the generation of film-like visual narratives. A plausible implication is that ShotWeaver40K, due to its annotation richness and focus on professional editing language, will facilitate exploration of hierarchical prompt engineering, semantic alignment, and camera control in generative vision research (Wu et al., 11 Dec 2025).

PDF Markdown Chat (Pro)

References (1)

ShotDirector: Directorially Controllable Multi-Shot Video Generation with Cinematographic Transitions (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to ShotWeaver40K Dataset.