Papers
Topics
Authors
Recent
2000 character limit reached

ShotWeaver40K Dataset

Updated 17 December 2025
  • ShotWeaver40K is a large-scale benchmark dataset of 40,000 two-shot video clips curated to capture professional film editing patterns and enable controllable multi-shot video generation.
  • It uses a comprehensive pipeline combining automated shot segmentation, image stitching, CLIP-based filtering, and hierarchical GPT-5-mini for detailed scene and shot-level annotations.
  • The dataset supports research with standardized metrics like Transition Confidence Score, TrAcc, and Fréchet Video Distance, advancing cinematic language modeling and directorial control.

ShotWeaver40K is a large-scale benchmark dataset comprising 40,000 two-shot video clips curated for the task of controllable multi-shot video generation, with explicit focus on film-like editing patterns and directorial transitions. Developed to support the evaluation and training of models seeking fine-grained control over shot composition, camera parameters, and professional editing schemes, ShotWeaver40K encodes the priors of real film-editing within a rigorously annotated structure and serves as the foundation for frameworks such as ShotDirector (Wu et al., 11 Dec 2025).

1. Dataset Composition and Distribution

ShotWeaver40K consists of 40,000 two-shot video clips, resulting in a total of 80,000 individual shots and 40,000 annotated transitions—one per clip. These transitions are evenly distributed among four professional editing patterns: Cut-in, Cut-out, Shot/Reverse-Shot, and Multi-Angle, with each occupying approximately 25% of the dataset. The clips originate from 16,000 full-length films, with careful selection to capture diversity in cinematic style and shot-relation semantics.

Dataset splits are as follows:

  • Training: 32,000 videos (80%)
  • Validation: 4,000 videos (10%)
  • Test: 4,000 videos (10%)
Split Videos Cut-in Cut-out Shot/Rev Multi-Angle
Train 32,000 8,000 8,000 8,000 8,000
Validation 4,000 1,000 1,000 1,000 1,000
Test 4,000 1,000 1,000 1,000 1,000

The split definition is formalized as Nsplit=αsplit×40,000N_{\mathrm{split}} = \alpha_{\mathrm{split}} \times 40,000, where (αtrain,αval,αtest)=(0.8,0.1,0.1)(\alpha_{\mathrm{train}}, \alpha_{\mathrm{val}}, \alpha_{\mathrm{test}}) = (0.8, 0.1, 0.1).

2. Data Acquisition and Annotation Pipeline

The data acquisition process integrates multiple automated modules and manual annotation stages:

  • Shot Segmentation: TransNetV2 is used to segment raw material from 16,000 films into discrete shots.
  • Stitching: ImageBind (threshold θstitch=0.65\theta_{\mathrm{stitch}}=0.65) stitches visually similar segments to form two-shot sequences.
  • Coarse Filtering imposes constraints: minimum frame rate (24 fps), resolution (≥720p), duration (5–12 s), and LAION-based aesthetic score evaluating frames near transitions.
  • Fine Filtering further applies CLIP-based similarity (<0.95<0.95) to eliminate "flash" or non-transitions, and Qwen2-VL VLM semantic continuity for logical sequence flow.
  • Annotation: Hierarchical GPT-5-mini annotates each video at three levels:
    • Scene-level caption
    • Shot-level captions (subject, framing, style cues)
    • Transition type and rationale
  • Camera Parameters: For each shot, intrinsics KR3×3K \in \mathbb{R}^{3 \times 3} and extrinsics E=[Rt], RSO(3), tR3E = [R|t],\ R \in SO(3),\ t \in \mathbb{R}^3 are estimated by VGGT relative to shot 1.

3. Data Structure and Metadata Specification

Each video is stored in an MP4 container (H.264, 24 fps, 256×256 px). Optional masks are provided as 8-bit PNG sequences. Metadata per video is encapsulated in JSON format, capturing detailed shot and transition information:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
{
  "video_id": "SW_00001",
  "shots": [
    {
      "shot_id": 1,
      "start_frame": 0,
      "end_frame": 75,
      "intrinsics": {"f": 50.0, "cx": 128, "cy": 128},
      "extrinsics": {"R": [[]], "t": []},
      "caption": "Medium shot of protagonist walking forward."
    },
    {
      "shot_id": 2,
      "start_frame": 76,
      "end_frame": 175,
      "intrinsics": {},
      "extrinsics": {},
      "caption": "Cut-in close-up of footwear tapping."
    }
  ],
  "transition": { "type": "Cut-in", "description": "Zoom to feet for emphasis." }
}

The LaTeX-style schema formalizes key fields: intrinsics={f,(cx,cy)},extrinsics={RR3×3,tR3},transition.type{Cut-in,Cut-out,Shot/Rev,Multi}\texttt{intrinsics} = \{f, (c_x,c_y)\},\quad \texttt{extrinsics} = \{R \in \mathbb{R}^{3 \times 3}, t \in \mathbb{R}^3\},\quad \texttt{transition.type} \in \{\mathrm{Cut\text{-}in}, \mathrm{Cut\text{-}out}, \mathrm{Shot/Rev}, \mathrm{Multi}\}

4. Editing Pattern Taxonomy

ShotWeaver40K encodes four canonical editing transitions:

  • Cut-in: Immediate shift to a tighter framing of the same subject.
  • Cut-out: Shift to a wider, more contextual shot.
  • Shot/Reverse-Shot: Alternates between dialogue participants.
  • Multi-Angle: Switches among distinct viewpoints of identical action.

Each pattern appears with frequency fi0.25f_i \approx 0.25. For extensions to longer shot sequences, ShotWeaver40K supports analysis of co-occurrence statistics: Let NijN_{ij} denote the frequency with which transition type ii is followed by type jj; thus, conditional probability Pij=NijkNikP_{ij} = \frac{N_{ij}}{\sum_k N_{ik}}. This structure provides a formal basis for learning cinematic language priors and modeling narrative flow.

5. Evaluation Metrics

ShotDirector introduces standardized metrics for quantifying performance in multi-shot video generation using ShotWeaver40K:

  • Transition Confidence Score (TrConf):

TrConf=maxfσ(df)\mathrm{TrConf} = \max_f \sigma(d_f)

with dfd_f as TransNetV2 logit at frame ff.

  • Transition Type Accuracy (TrAcc):

TrAcc=#{correctly classified types}Neval\mathrm{TrAcc} = \frac{\#\{\text{correctly classified types}\}}{N_{\mathrm{eval}}}

  • Aesthetic Score (Aes): Predicts overall visual appeal using a LAION-trained model.
  • Imaging Quality (Img): MUSIQ image quality metric.
  • Semantic Alignment (Sem):

Sem=cos(EViCLIP(video),EViCLIP(text))\mathrm{Sem} = \cos\big(E_{\mathrm{ViCLIP}}(\text{video}), E_{\mathrm{ViCLIP}}(\text{text})\big)

VisC=12(cos(Ssubj,1,Ssubj,2)+cos(Bbkg,1,Bbkg,2))\mathrm{VisC} = \frac{1}{2}\left(\cos(S_{\mathrm{subj},1}, S_{\mathrm{subj},2}) + \cos(B_{\mathrm{bkg},1}, B_{\mathrm{bkg},2})\right)

  • Camera Parameter Fidelity:

RotErr=1NiRipredRigtF,TransErr=1Nitipredtigt2\mathrm{RotErr} = \frac{1}{N} \sum_i \| R^{\mathrm{pred}}_i - R^{\mathrm{gt}}_i \|_F,\quad \mathrm{TransErr} = \frac{1}{N} \sum_i \| t^{\mathrm{pred}}_i - t^{\mathrm{gt}}_i \|_2

6. Usage Protocols and Baseline Methods

Recommended usage involves a two-stage training strategy:

  • Stage I: Train exclusively on ShotWeaver40K real data, learning rate 1×1041 \times 10^{-4}, 10,000 steps.
  • Stage II: Mix 70% ShotWeaver40K and 30% SynCamVideo synthetic data, learning rate 5×1055 \times 10^{-5}, 3,000 steps.

Evaluation prompts should use global scene captions, per-shot captions, and explicit camera pose parameters. Baseline performance (test split, Table 1) is tabulated below:

Method TrConf TrAcc Aes Img Sem FVD SemC VisC
Mask²DiT 0.223 0.203 0.596 0.684 0.780 69.49 0.780 0.778
CineTrans 0.798 0.394 0.631 0.691 0.791 71.89 0.792 0.785
StoryDiffusion 0.522 0.581 0.674 0.452 92.21 0.452 0.587
Phantom 0.621 0.618 0.679 0.538 86.61 0.538 0.571
HunyuanVideo 0.470 0.322 0.610 0.616 0.570 69.88 0.570 0.660
Wan2.2 0.217 0.102 0.589 0.620 0.689 69.48 0.689 0.755
SynCamMaster 0.303 0.545 0.618 0.795 72.47 0.795 0.842
ReCamMaster 0.027 0.033 0.549 0.611 71.51
ShotDirector 0.896 0.674 0.637 0.698 0.792 68.45 0.792 0.825

The results demonstrate varying degrees of competency across controllability, aesthetic fidelity, and editing-pattern consistency, with ShotDirector providing the strongest metrics on this benchmark in terms of transition and semantic accuracy.

7. Research Significance and Benchmark Role

ShotWeaver40K provides a standard benchmark for developing and evaluating frameworks that address directorial control in multi-shot video generation. By explicitly encoding cinematic transition patterns and detailed camera parameters, the dataset allows systematic analysis of shot-to-shot relationships, transition modeling, and directorial intent. Its balanced taxonomy and structured metadata enable models to go beyond naive sequential consistency, advancing the generation of film-like visual narratives. A plausible implication is that ShotWeaver40K, due to its annotation richness and focus on professional editing language, will facilitate exploration of hierarchical prompt engineering, semantic alignment, and camera control in generative vision research (Wu et al., 11 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to ShotWeaver40K Dataset.