OneStory: Coherent Multi-Shot Video Framework

Updated 10 December 2025

OneStory is a framework for coherent multi-shot video generation that integrates compact memory encoding and adaptive visual conditioning to enhance narrative consistency.
It employs a Frame Selection module to extract salient global context and an Adaptive Conditioner to efficiently inject contextual tokens into a diffusion transformer.
Experimental evaluations reveal improved inter-shot coherence and semantic alignment, outperforming previous multi-shot video generation methods.

OneStory is a framework for coherent multi-shot video generation that enables stronger narrative consistency across discontinuous but semantically linked video shots. It addresses the limitations of prior multi-shot video generation (MSV) methods, which suffer from weak long-range context modeling due to small temporal windows or single keyframe conditioning. OneStory achieves compact, global cross-shot context modeling, supports scalable narrative synthesis, and leverages pretrained image-to-video (I2V) models for robust visual conditioning. The design introduces two core modules—Frame Selection and Adaptive Conditioner—to efficiently encode and inject salient global memory into a diffusion transformer architecture, supporting both text- and image-conditioned controllable storytelling (An et al., 8 Dec 2025).

1. Problem Definition and Autoregressive Formulation

Let a multi-shot video be defined as $V = \{S_1, S_2, \dots, S_N\}$ , where each shot $S_i$ comprises $K$ RGB frames, $S_i = \{f_{i,1}, \dots, f_{i,K}\}$ , $f_{i,j} \in \mathbb{R}^{H \times W \times 3}$ . Each shot is paired with a referential caption $C_i$ ; (optional) global prompt or conditioning input is denoted $\mathcal{T}$ .

OneStory casts MSV as a sequence of next-shot generation problems. At time $t$ , the model conditions on a compact memory representation $\mathbf{M}_{t-1}$ encoding all previous shots, global conditioning $\mathcal{T}$ , and current caption $C_t$ , modeling the conditional distribution: $P_\theta(S_t \mid \mathbf{M}_{t-1},\; \mathcal{T},\; C_t)$ Generation proceeds in an autoregressive manner: $\hat S_t \sim P_\theta(S_t \mid \mathbf{M}_{t-1},\; \mathcal{T},\; C_t)$ This design enables accumulation and exploitation of global narrative state for shot-level synthesis.

2. Frame Selection Module: Compact Memory Construction

The Frame Selection module builds $\mathbf{M}_{t-1}$ by encoding all prior frames. Each frame $f_{i,\tau}$ (from shots $S_1$ through $S_{t-1}$ , subsampled at interval $f_t$ ) is mapped to a latent code via a 3D–VAE encoder $\mathcal{E}$ : $\mathbf{z}_{i}^{(\tau)} = \mathcal{E}(f_{i,\tau}) \in \mathbb{R}^{N_s \times D_v}$ All such codes are concatenated into a memory tensor $\mathbf{M}$ of shape $F \times N_s \times D_v$ where $F = (t-1) \frac{K}{f_t}$ .

To enable context selection, $m$ learnable query tokens $\mathbf{Q}$ attend first to the current caption tokens, then to projected visual memory. A lightweight projector transforms frames, and attention with these query tokens gives an $F \times m$ relevance score matrix. Averaging over $m$ queries yields a vector of frame scores $\mathbf{S}$ , and the top- $K_{\mathrm{sel}}$ scored frames are selected: $\widehat{\mathbf{M}} = \mathrm{TopK}(\mathbf{M}, \mathbf{S}, K_{\mathrm{sel}})$ This compact subset forms the global memory $\mathbf{M}_{t-1}$ used for subsequent conditioning.

3. Adaptive Conditioner: Importance-Guided Patchification

Even after selection, $\widehat{\mathbf{M}}$ may be too large for direct transformer input. The Adaptive Conditioner partitions selected frames—using their ranking in $\mathbf{S}$ —among $L_p$ patchifiers $\{\mathcal{P}_\ell\}$ with varying kernel sizes/strides, assigning high-score frames to fine-granularity patchifiers.

Each group is patchified, projected, and flattened: $\mathbf{C}_\ell = \mathcal{P}_\ell(\widehat{\mathbf{M}}_{\mathcal{I}_\ell}) \in \mathbb{R}^{N_\ell \times D}$ Context tokens from all groups are concatenated: $\mathbf{C} = \operatorname{Concat}_{\ell=1}^{L_p}[\mathbf{C}_\ell] \in \mathbb{R}^{N_c \times D}$ In the diffusion transformer (DiT) pipeline, these are concatenated with noisy target-shot latents $\mathbf{N}$ , forming

$\mathbf{X} = \operatorname{Concat}[\mathbf{N}, \mathbf{C}]$

Transformer blocks then jointly process shot noise and historical context, fostering global–local temporal consistency.

4. Model Architecture and Training Paradigm

OneStory is architected atop a pretrained I2V backbone (e.g., Wan2.1), with two principal extensions:

Frame Selection module operates post-encoding of all past shots
Adaptive Conditioner inserts its output before each transformer's spatial–temporal attention layers

Fine-tuning is conducted end-to-end, initializing all other weights from the I2V base. Three main training objectives are adopted:

Diffusion reconstruction loss:

$L_{\mathrm{video}} = \mathbb{E}_{S_t,\epsilon}[\|\epsilon - \epsilon_\theta(\mathbf{X}, t)\|^2]$

Memory-selector regularization:

$L_{\mathrm{mem}} = \frac{1}{F}\sum_{r=1}^F(S_r - y_r)^2$

with $y_r$ denoting pseudo-labels.

(Optional) Semantic contrastive loss, e.g., with CLIP-based alignment.

The joint objective is: $\mathcal{L} = L_{\mathrm{video}} + \lambda_{\mathrm{mem}} L_{\mathrm{mem}} + \lambda_c L_{\mathrm{contrast}}$ Training incorporates a decoupled conditioning curriculum (uniform sampling then selector-driven), and "shot inflation," wherein all samples are converted to synthetic three-shot sequences to regularize next-shot generation (An et al., 8 Dec 2025).

5. Dataset Design and Curation Pipeline

Training leverages a curated dataset of approximately 60,000 human-centric, multi-shot videos. Curation involves:

Shot boundary detection using TransNetV2
Two-stage referential captioning: initial independent captioning followed by rewriting to induce referential language (e.g., "the same man," "then she moves")
Multi-stage filtering: keyword-based safety, CLIP/SigLIP semantic alignment, and DINOv2-based duplicate removal

The final corpus contains 50K two-shot and 10K three-shot sequences, each shot paired with a progressive caption, but no global script.

6. Experimental Results and Ablation Analyses

Quantitative evaluation was performed on 64 six-shot test cases from T2MSV and I2MSV benchmarks, using metrics including inter-shot coherence (character/environment consistency via DINOv2 and YOLO segmentation), semantic alignment (ViCLIP frame-caption scores), and intra-shot quality (subject/background consistency, aesthetic quality, dynamic degree).

Key comparative results (text-conditioned setting):

Inter-shot coherence: OneStory 0.5813, outperforms Mask²DiT, StoryDiff+Wan2.1, and Flux+Wan2.1 (next-best 0.5657)
Semantic alignment: 0.2389 (best), next-best 0.2253
Top performance on all intra-shot metrics

Ablations demonstrate that:

Removing Frame Selection reduces character coherence from 0.5813 to 0.5526
Removing Adaptive Conditioner further drops it to 0.5465
Absence of shot inflation or decoupled conditioning decreases coherence by ≈0.02
Even a minimal context (one latent-frame’s worth of tokens) suffices for sizable improvements; increasing context brings diminishing returns

7. Limitations and Prospects

OneStory is validated on sequences up to ~10 shots. Scalability to hundreds of shots may require hierarchical memory or more aggressive compression beyond top-K frame selection. The trade-off between computational cost and context length prompts exploration of dynamic token budgets or retrieval-augmented selectors.

Cross-modal extensions—including audio-visual alignment or LLM-guided multi-agent shot planning—are identified as natural continuations. Advances in memory supervision, such as richer contrastive or graph-based regularizers, are considered promising for enhancing narrative coherence across long temporal horizons (An et al., 8 Dec 2025).

PDF Markdown Chat (Pro)

References (1)

OneStory: Coherent Multi-Shot Video Generation with Adaptive Memory (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to OneStory.