Papers
Topics
Authors
Recent
2000 character limit reached

OneStory: Coherent Multi-Shot Video Framework

Updated 10 December 2025
  • OneStory is a framework for coherent multi-shot video generation that integrates compact memory encoding and adaptive visual conditioning to enhance narrative consistency.
  • It employs a Frame Selection module to extract salient global context and an Adaptive Conditioner to efficiently inject contextual tokens into a diffusion transformer.
  • Experimental evaluations reveal improved inter-shot coherence and semantic alignment, outperforming previous multi-shot video generation methods.

OneStory is a framework for coherent multi-shot video generation that enables stronger narrative consistency across discontinuous but semantically linked video shots. It addresses the limitations of prior multi-shot video generation (MSV) methods, which suffer from weak long-range context modeling due to small temporal windows or single keyframe conditioning. OneStory achieves compact, global cross-shot context modeling, supports scalable narrative synthesis, and leverages pretrained image-to-video (I2V) models for robust visual conditioning. The design introduces two core modules—Frame Selection and Adaptive Conditioner—to efficiently encode and inject salient global memory into a diffusion transformer architecture, supporting both text- and image-conditioned controllable storytelling (An et al., 8 Dec 2025).

1. Problem Definition and Autoregressive Formulation

Let a multi-shot video be defined as V={S1,S2,,SN}V = \{S_1, S_2, \dots, S_N\}, where each shot SiS_i comprises KK RGB frames, Si={fi,1,,fi,K}S_i = \{f_{i,1}, \dots, f_{i,K}\}, fi,jRH×W×3f_{i,j} \in \mathbb{R}^{H \times W \times 3}. Each shot is paired with a referential caption CiC_i; (optional) global prompt or conditioning input is denoted T\mathcal{T}.

OneStory casts MSV as a sequence of next-shot generation problems. At time tt, the model conditions on a compact memory representation Mt1\mathbf{M}_{t-1} encoding all previous shots, global conditioning T\mathcal{T}, and current caption CtC_t, modeling the conditional distribution: Pθ(StMt1,  T,  Ct)P_\theta(S_t \mid \mathbf{M}_{t-1},\; \mathcal{T},\; C_t) Generation proceeds in an autoregressive manner: S^tPθ(StMt1,  T,  Ct)\hat S_t \sim P_\theta(S_t \mid \mathbf{M}_{t-1},\; \mathcal{T},\; C_t) This design enables accumulation and exploitation of global narrative state for shot-level synthesis.

2. Frame Selection Module: Compact Memory Construction

The Frame Selection module builds Mt1\mathbf{M}_{t-1} by encoding all prior frames. Each frame fi,τf_{i,\tau} (from shots S1S_1 through St1S_{t-1}, subsampled at interval ftf_t) is mapped to a latent code via a 3D–VAE encoder E\mathcal{E}: zi(τ)=E(fi,τ)RNs×Dv\mathbf{z}_{i}^{(\tau)} = \mathcal{E}(f_{i,\tau}) \in \mathbb{R}^{N_s \times D_v} All such codes are concatenated into a memory tensor M\mathbf{M} of shape F×Ns×DvF \times N_s \times D_v where F=(t1)KftF = (t-1) \frac{K}{f_t}.

To enable context selection, mm learnable query tokens Q\mathbf{Q} attend first to the current caption tokens, then to projected visual memory. A lightweight projector transforms frames, and attention with these query tokens gives an F×mF \times m relevance score matrix. Averaging over mm queries yields a vector of frame scores S\mathbf{S}, and the top- KselK_{\mathrm{sel}} scored frames are selected: M^=TopK(M,S,Ksel)\widehat{\mathbf{M}} = \mathrm{TopK}(\mathbf{M}, \mathbf{S}, K_{\mathrm{sel}}) This compact subset forms the global memory Mt1\mathbf{M}_{t-1} used for subsequent conditioning.

3. Adaptive Conditioner: Importance-Guided Patchification

Even after selection, M^\widehat{\mathbf{M}} may be too large for direct transformer input. The Adaptive Conditioner partitions selected frames—using their ranking in S\mathbf{S}—among LpL_p patchifiers {P}\{\mathcal{P}_\ell\} with varying kernel sizes/strides, assigning high-score frames to fine-granularity patchifiers.

Each group is patchified, projected, and flattened: C=P(M^I)RN×D\mathbf{C}_\ell = \mathcal{P}_\ell(\widehat{\mathbf{M}}_{\mathcal{I}_\ell}) \in \mathbb{R}^{N_\ell \times D} Context tokens from all groups are concatenated: C=Concat=1Lp[C]RNc×D\mathbf{C} = \operatorname{Concat}_{\ell=1}^{L_p}[\mathbf{C}_\ell] \in \mathbb{R}^{N_c \times D} In the diffusion transformer (DiT) pipeline, these are concatenated with noisy target-shot latents N\mathbf{N}, forming

X=Concat[N,C]\mathbf{X} = \operatorname{Concat}[\mathbf{N}, \mathbf{C}]

Transformer blocks then jointly process shot noise and historical context, fostering global–local temporal consistency.

4. Model Architecture and Training Paradigm

OneStory is architected atop a pretrained I2V backbone (e.g., Wan2.1), with two principal extensions:

  • Frame Selection module operates post-encoding of all past shots
  • Adaptive Conditioner inserts its output before each transformer's spatial–temporal attention layers

Fine-tuning is conducted end-to-end, initializing all other weights from the I2V base. Three main training objectives are adopted:

  • Diffusion reconstruction loss:

Lvideo=ESt,ϵ[ϵϵθ(X,t)2]L_{\mathrm{video}} = \mathbb{E}_{S_t,\epsilon}[\|\epsilon - \epsilon_\theta(\mathbf{X}, t)\|^2]

  • Memory-selector regularization:

Lmem=1Fr=1F(Sryr)2L_{\mathrm{mem}} = \frac{1}{F}\sum_{r=1}^F(S_r - y_r)^2

with yry_r denoting pseudo-labels.

  • (Optional) Semantic contrastive loss, e.g., with CLIP-based alignment.

The joint objective is: L=Lvideo+λmemLmem+λcLcontrast\mathcal{L} = L_{\mathrm{video}} + \lambda_{\mathrm{mem}} L_{\mathrm{mem}} + \lambda_c L_{\mathrm{contrast}} Training incorporates a decoupled conditioning curriculum (uniform sampling then selector-driven), and "shot inflation," wherein all samples are converted to synthetic three-shot sequences to regularize next-shot generation (An et al., 8 Dec 2025).

5. Dataset Design and Curation Pipeline

Training leverages a curated dataset of approximately 60,000 human-centric, multi-shot videos. Curation involves:

  • Shot boundary detection using TransNetV2
  • Two-stage referential captioning: initial independent captioning followed by rewriting to induce referential language (e.g., "the same man," "then she moves")
  • Multi-stage filtering: keyword-based safety, CLIP/SigLIP semantic alignment, and DINOv2-based duplicate removal

The final corpus contains 50K two-shot and 10K three-shot sequences, each shot paired with a progressive caption, but no global script.

6. Experimental Results and Ablation Analyses

Quantitative evaluation was performed on 64 six-shot test cases from T2MSV and I2MSV benchmarks, using metrics including inter-shot coherence (character/environment consistency via DINOv2 and YOLO segmentation), semantic alignment (ViCLIP frame-caption scores), and intra-shot quality (subject/background consistency, aesthetic quality, dynamic degree).

Key comparative results (text-conditioned setting):

  • Inter-shot coherence: OneStory 0.5813, outperforms Mask²DiT, StoryDiff+Wan2.1, and Flux+Wan2.1 (next-best 0.5657)
  • Semantic alignment: 0.2389 (best), next-best 0.2253
  • Top performance on all intra-shot metrics

Ablations demonstrate that:

  • Removing Frame Selection reduces character coherence from 0.5813 to 0.5526
  • Removing Adaptive Conditioner further drops it to 0.5465
  • Absence of shot inflation or decoupled conditioning decreases coherence by ≈0.02
  • Even a minimal context (one latent-frame’s worth of tokens) suffices for sizable improvements; increasing context brings diminishing returns

7. Limitations and Prospects

OneStory is validated on sequences up to ~10 shots. Scalability to hundreds of shots may require hierarchical memory or more aggressive compression beyond top-K frame selection. The trade-off between computational cost and context length prompts exploration of dynamic token budgets or retrieval-augmented selectors.

Cross-modal extensions—including audio-visual alignment or LLM-guided multi-agent shot planning—are identified as natural continuations. Advances in memory supervision, such as richer contrastive or graph-based regularizers, are considered promising for enhancing narrative coherence across long temporal horizons (An et al., 8 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to OneStory.