MultiShotMaster: A Controllable Multi-Shot Video Generation Framework (2512.03041v1)

Published 2 Dec 2025 in cs.CV

Abstract: Current video generation techniques excel at single-shot clips but struggle to produce narrative multi-shot videos, which require flexible shot arrangement, coherent narrative, and controllability beyond text prompts. To tackle these challenges, we propose MultiShotMaster, a framework for highly controllable multi-shot video generation. We extend a pretrained single-shot model by integrating two novel variants of RoPE. First, we introduce Multi-Shot Narrative RoPE, which applies explicit phase shift at shot transitions, enabling flexible shot arrangement while preserving the temporal narrative order. Second, we design Spatiotemporal Position-Aware RoPE to incorporate reference tokens and grounding signals, enabling spatiotemporal-grounded reference injection. In addition, to overcome data scarcity, we establish an automated data annotation pipeline to extract multi-shot videos, captions, cross-shot grounding signals and reference images. Our framework leverages the intrinsic architectural properties to support multi-shot video generation, featuring text-driven inter-shot consistency, customized subject with motion control, and background-driven customized scene. Both shot count and duration are flexibly configurable. Extensive experiments demonstrate the superior performance and outstanding controllability of our framework.

Summary

The paper presents a novel framework that extends single-shot T2V models to support narrative multi-shot video generation with explicit positional control.
It introduces two innovative RoPE variants for precise shot transitions and spatiotemporal grounding, enhancing both subject motion and scene consistency.
Experimental results show improved text-video alignment, inter-shot coherence, and measurable gains in narrative quality and visual performance.

MultiShotMaster: A Controllable Multi-Shot Video Generation Framework

Framework Overview and Motivations

MultiShotMaster introduces a comprehensive framework for controllable multi-shot video generation, extending the capabilities of pretrained single-shot text-to-video (T2V) architectures by leveraging novel positional encoding mechanisms and spatiotemporal grounding strategies. The key motivation is to bridge the gap between existing single-shot video generation approaches—which rely heavily on text prompts and lack narrative or structural coherence across multiple shots—and the requirements of practical content creation workflows, which necessitate precise control over shot boundaries, subject appearances, and scene consistency.

A central architectural contribution is the manipulation and extension of Rotary Position Embeddings (RoPE) in the transformer attention modules. The framework introduces two RoPE variants:

Multi-Shot Narrative RoPE: Implements explicit phase shifts at shot boundaries, allowing flexible arrangement and precise control of shot transitions without introducing additional trainable parameters or interfering with pretrained attention dynamics.
Spatiotemporal Position-Aware RoPE: Enables fine-grained control by injecting reference tokens (subject images, background images) at specified spatiotemporal positions via region-specific RoPE application, facilitating customized subject motion and scene consistency.
Figure 1: Overview of MultiShotMaster architecture, highlighting the integration of narrative and position-aware RoPE variants and the multi-reference attention mask for advanced controllability.

To further optimize inter-shot information flow and reference injection, the framework utilizes a hierarchical prompt structure (global caption and per-shot captions), a multi-reference attention mask, and a hierarchical multi-shot data annotation pipeline.

Data Curation and Annotation Process

The quality and consistency of training data are critical for narrative and controllable multi-shot video generation. MultiShotMaster incorporates an automated data pipeline involving:

Shot boundary detection: Using TransNet V2 to segment lengthy source videos into discrete shots.
Scene clustering and sampling: Aggregating shots into scenes based on storyline coherence and sampling multi-shot sequences with prioritized diversity.
Hierarchical caption generation: Employing Gemini-2.5 for detailed global and per-shot captioning, enabling precise cross-shot subject, background, and action annotations.
Reference extraction: Integrating YOLOv11, ByteTrack, and SAM for subject tracking/segmentation, augmented with Gemini-2.5 for cross-shot identity grouping and OmniEraser for clean background extraction.
Figure 2: Multi-stage data curation pipeline, enabling robust annotation of shots, subjects, backgrounds, and hierarchical captions for training.

This pipeline ensures that the model is exposed to richly annotated, context-aware training samples, supporting the controllability objectives of the framework.

Spatiotemporal Grounded Reference Injection

MultiShotMaster enables in-context injection of reference images (for both subjects and backgrounds) and explicit motion control by associating reference tokens with specific spatiotemporal regions in the latent space. Through specialized adaptations of 3D RoPE, reference tokens are assigned position indices corresponding to target bounding boxes at designated time steps, ensuring that the attention weights are maximized for semantically aligned pixel regions.

To control subject trajectories, multiple copies of subject tokens are generated, each with unique spatiotemporal RoPE encoding. The value components are averaged post-attention across copies, allowing precise modeling of dynamic character movements within and across shots. Background tokens are handled analogously, supporting shot-level scene customization and multi-shot scene consistency.

Multi-Reference Attention Mask and Information Flow

The concatenation of multiple reference images and shot-specific video latents can dramatically expand context length, leading to unnecessary interactions and computational overhead. To address this, MultiShotMaster implements a multi-reference attention mask that selectively allocates attention. Each shot accesses only its associated reference tokens while maintaining global attention among video tokens for overall inter-shot coherence. This architectural choice ensures that context leakage is minimized and reference injection is locally precise while still supporting holistic narrative consistency.

Training Regime and Inference Architecture

The training paradigm is multi-stage:

Stage 1: Finetuning for spatiotemporal reference injection on single-shot videos (300k samples).
Stage 2: Joint training on curated multi-shot data, updating temporal and cross-attention modules and FFN.
Stage 3: Subject-focused post-training, applying increased loss weight to subject regions to enforce robust identity consistency across shots.

At inference, the model supports flexible shot count and duration, hierarchical prompt parsing, and enables simultaneous controllable injection of texts, subjects, backgrounds, and grounding signals.

Experimental Results and Quantitative Analysis

MultiShotMaster demonstrates strong quantitative and qualitative results across a suite of evaluation metrics:

Text Alignment (TA): Achieves higher text-video similarity than competitive baselines.
Inter-Shot Consistency: Superior semantic, subject, and scene consistency scores.
Transition Deviation: Lower frame deviation error at shot boundaries, indicating precise transition control.
Narrative Coherence: Significant improvements in cross-shot storytelling logic as evaluated by Gemini-2.5.
Reference Injection Consistency and Grounding: Outperforms single-shot baselines in subject/background consistency, with high mIoU for spatiotemporal grounding.
Figure 3: Qualitative comparison demonstrating subject and background consistency across shots, with effective injection and narrative control outperforming multi-shot and single-shot methods.

Notably, the method maintains cross-shot subject appearance and scene attributes, even in cases where backgrounds occupy small regions or subjects have complex trajectory specifications.

Limitations and Directions for Future Research

Despite the advantages, several limitations remain:

Camera Control: Explicit subject motion can be controlled via grounding signals, but camera positioning remains text-driven, causing potential coupling artifacts where object and camera motion are entangled.
Resolution and Base Model: The current implementation uses a relatively small ( $\sim$ 1B parameter) T2V backbone at moderate resolution (384x672), trailing open-source models like WAN in raw quality.
Figure 4: Limitation scenario, illustrating the residual coupling between subject motion and camera control, pointing to areas for future work.

Future directions include integrating independent camera control mechanisms, scaling the approach to higher-resolution large-scale models, and exploring finer-grained motion and scene compositionality.

Conclusion

MultiShotMaster establishes an effective paradigm for controllable, narrative multi-shot video generation by extending transformer-based T2V models with novel RoPE variants, robust data annotation, and contextually-aware attention mechanisms. The strong results across text alignment, cross-shot consistency, and user-driven controllability imply significant potential for practical film and media generation workflows. The framework’s approach to spatiotemporal-grounded reference injection is extensible to broader multimodal generative tasks, and further research into disentangling camera and object motion could catalyze the next generation of AI-driven cinematic content creation.