OneStory: Coherent Multi-Shot Video Framework
- OneStory is a framework for coherent multi-shot video generation that integrates compact memory encoding and adaptive visual conditioning to enhance narrative consistency.
- It employs a Frame Selection module to extract salient global context and an Adaptive Conditioner to efficiently inject contextual tokens into a diffusion transformer.
- Experimental evaluations reveal improved inter-shot coherence and semantic alignment, outperforming previous multi-shot video generation methods.
OneStory is a framework for coherent multi-shot video generation that enables stronger narrative consistency across discontinuous but semantically linked video shots. It addresses the limitations of prior multi-shot video generation (MSV) methods, which suffer from weak long-range context modeling due to small temporal windows or single keyframe conditioning. OneStory achieves compact, global cross-shot context modeling, supports scalable narrative synthesis, and leverages pretrained image-to-video (I2V) models for robust visual conditioning. The design introduces two core modules—Frame Selection and Adaptive Conditioner—to efficiently encode and inject salient global memory into a diffusion transformer architecture, supporting both text- and image-conditioned controllable storytelling (An et al., 8 Dec 2025).
1. Problem Definition and Autoregressive Formulation
Let a multi-shot video be defined as , where each shot comprises RGB frames, , . Each shot is paired with a referential caption ; (optional) global prompt or conditioning input is denoted .
OneStory casts MSV as a sequence of next-shot generation problems. At time , the model conditions on a compact memory representation encoding all previous shots, global conditioning , and current caption , modeling the conditional distribution: Generation proceeds in an autoregressive manner: This design enables accumulation and exploitation of global narrative state for shot-level synthesis.
2. Frame Selection Module: Compact Memory Construction
The Frame Selection module builds by encoding all prior frames. Each frame (from shots through , subsampled at interval ) is mapped to a latent code via a 3D–VAE encoder : All such codes are concatenated into a memory tensor of shape where .
To enable context selection, learnable query tokens attend first to the current caption tokens, then to projected visual memory. A lightweight projector transforms frames, and attention with these query tokens gives an relevance score matrix. Averaging over queries yields a vector of frame scores , and the top- scored frames are selected: This compact subset forms the global memory used for subsequent conditioning.
3. Adaptive Conditioner: Importance-Guided Patchification
Even after selection, may be too large for direct transformer input. The Adaptive Conditioner partitions selected frames—using their ranking in —among patchifiers with varying kernel sizes/strides, assigning high-score frames to fine-granularity patchifiers.
Each group is patchified, projected, and flattened: Context tokens from all groups are concatenated: In the diffusion transformer (DiT) pipeline, these are concatenated with noisy target-shot latents , forming
Transformer blocks then jointly process shot noise and historical context, fostering global–local temporal consistency.
4. Model Architecture and Training Paradigm
OneStory is architected atop a pretrained I2V backbone (e.g., Wan2.1), with two principal extensions:
- Frame Selection module operates post-encoding of all past shots
- Adaptive Conditioner inserts its output before each transformer's spatial–temporal attention layers
Fine-tuning is conducted end-to-end, initializing all other weights from the I2V base. Three main training objectives are adopted:
- Diffusion reconstruction loss:
- Memory-selector regularization:
with denoting pseudo-labels.
- (Optional) Semantic contrastive loss, e.g., with CLIP-based alignment.
The joint objective is: Training incorporates a decoupled conditioning curriculum (uniform sampling then selector-driven), and "shot inflation," wherein all samples are converted to synthetic three-shot sequences to regularize next-shot generation (An et al., 8 Dec 2025).
5. Dataset Design and Curation Pipeline
Training leverages a curated dataset of approximately 60,000 human-centric, multi-shot videos. Curation involves:
- Shot boundary detection using TransNetV2
- Two-stage referential captioning: initial independent captioning followed by rewriting to induce referential language (e.g., "the same man," "then she moves")
- Multi-stage filtering: keyword-based safety, CLIP/SigLIP semantic alignment, and DINOv2-based duplicate removal
The final corpus contains 50K two-shot and 10K three-shot sequences, each shot paired with a progressive caption, but no global script.
6. Experimental Results and Ablation Analyses
Quantitative evaluation was performed on 64 six-shot test cases from T2MSV and I2MSV benchmarks, using metrics including inter-shot coherence (character/environment consistency via DINOv2 and YOLO segmentation), semantic alignment (ViCLIP frame-caption scores), and intra-shot quality (subject/background consistency, aesthetic quality, dynamic degree).
Key comparative results (text-conditioned setting):
- Inter-shot coherence: OneStory 0.5813, outperforms Mask²DiT, StoryDiff+Wan2.1, and Flux+Wan2.1 (next-best 0.5657)
- Semantic alignment: 0.2389 (best), next-best 0.2253
- Top performance on all intra-shot metrics
Ablations demonstrate that:
- Removing Frame Selection reduces character coherence from 0.5813 to 0.5526
- Removing Adaptive Conditioner further drops it to 0.5465
- Absence of shot inflation or decoupled conditioning decreases coherence by ≈0.02
- Even a minimal context (one latent-frame’s worth of tokens) suffices for sizable improvements; increasing context brings diminishing returns
7. Limitations and Prospects
OneStory is validated on sequences up to ~10 shots. Scalability to hundreds of shots may require hierarchical memory or more aggressive compression beyond top-K frame selection. The trade-off between computational cost and context length prompts exploration of dynamic token budgets or retrieval-augmented selectors.
Cross-modal extensions—including audio-visual alignment or LLM-guided multi-agent shot planning—are identified as natural continuations. Advances in memory supervision, such as richer contrastive or graph-based regularizers, are considered promising for enhancing narrative coherence across long temporal horizons (An et al., 8 Dec 2025).