Papers
Topics
Authors
Recent
2000 character limit reached

OneStory: Multi-Shot Video Generation

Updated 11 December 2025
  • OneStory Framework is a neural framework for multi-shot video generation that models narrative continuity across semantically linked clips using autoregressive next-shot synthesis.
  • It employs a learned frame selection module and adaptive context conditioning to efficiently extract and integrate critical visual memory for coherent story progression.
  • Empirical results on a curated 60K video dataset demonstrate enhanced inter-shot coherence and semantic alignment compared to fixed-window and keyframe-based methods.

OneStory is a neural framework for coherent multi-shot video generation (MSV) that models narrative continuity across multiple semantically linked video clips ("shots"). It formulates MSV as an autoregressive next-shot synthesis problem, leveraging pretrained image-to-video (I2V) transformers, a learned frame selection mechanism for constructing compact global memories, and adaptive context composition for efficient visual conditioning. OneStory achieves state-of-the-art results on a curated 60K multi-shot dataset with referential captions, enabling controllable and immersive long-form video storytelling by overcoming the context-tracking limitations of fixed-window and keyframe-based alternatives (An et al., 8 Dec 2025).

1. Challenges in Multi-Shot Video Generation

Multi-Shot Video Generation centers on synthesizing a sequence {S1,,SN}\{S_1,\ldots,S_N\} of discontinuous yet semantically coherent shots. Each shot may differ in composition, viewpoint, or time, but the narrative requires persistent entities (characters, objects, environments) and consistent storyline evolution. The technical challenges include:

  • Long-range Cross-Shot Context Modeling: Entities must persist or reappear even when callbacks cross many intervening shots; maintaining consistent identity, spatial configuration, or world state is essential.
  • Narrative Consistency vs. Shot Diversity: The model must discern which elements remain fixed (character, layout) and which are allowed—or expected—to evolve (action, camera movement, visual style).
  • Deficiencies of Prior Art: Fixed-window attention mechanisms lose important early context as the window slides, causing memory loss and narrative breakage. Keyframe-conditioned pipelines compress each shot’s context to one frame, unable to encode evolving relationships or long-range dependencies, particularly for entity recurrence or foreshadowing (An et al., 8 Dec 2025).

2. Next-Shot Autoregressive Paradigm

OneStory recasts MSV into an autoregressive next-shot generation task. Given previous shots {Sj}j<i\{S_j\}_{j<i} and the current caption CiC_i, it synthesizes SiS_i using the generative process:

p(S1,...,SNC1,...,CN)=i=1Np(SiS<i,Ci)p(S_1, ..., S_N| C_1, ..., C_N) = \prod_{i=1}^N p(S_i| S_{<i}, C_i)

Key system components:

  • 3D-VAE Encoder (EE): Each shot SjS_j is encoded into latent tensors zjRT×H×W×Dvz_j \in \mathbb{R}^{T' \times H' \times W' \times D_v}.
  • Text Encoder (TT): Captions CiC_i are mapped via a CLIP/LLaMA-based encoder to token embeddings tiRL×Dtt_i \in \mathbb{R}^{L \times D_t}.
  • Diffusion Transformer (𝒢𝒢): Fine-tuned for next-shot conditioning, this transformer predicts SiS_i conditioned on latents, text, and selected contextual memories.

This paradigm allows fully global cross-shot context, supporting complex, multi-stage narratives and robust context-dependent entity reappearance (An et al., 8 Dec 2025).

3. Frame Selection: Global Memory Construction

To address scalability and overcompression, OneStory implements a sparsity-driven Frame Selection module:

  • Global Latent Memory (MM): All previous encoded shot latent frames are stacked: MRF×Ns×DvM \in \mathbb{R}^{F \times N_s \times D_v}, where F=(i1)TF = (i-1)\cdot T', Ns=HWN_s = H' \cdot W'.
  • Learnable Text-Guided Queries: mm learnable queries QRm×DQ \in \mathbb{R}^{m \times D} are updated with caption-guided attention and memory cross-attention, integrating narrative intent.
  • Relevance Scoring & Top-K Selection: Similarity scores S[f]S[f] are computed between projected memory frames and query keys, with supervision derived from DINOv2/CLIP similarities to ground truth. Top KselK_{\text{sel}} most relevant frames are selected to form a compact context set M^\hat{M}.
  • Supervision & Pseudolabeling: During training, frame scores are matched against DINOv2/CLIP-based labels, supporting robust selection of contextually important shots.

This process enables the system to reference only a small, semantically critical portion of global memory, overcoming full-history inefficiency and keyframe myopia (An et al., 8 Dec 2025).

4. Adaptive Conditioning Mechanism

Selected context frames are compressed and structured using the Adaptive Conditioner module:

  • Patchifier Bank: A set of LpL_p spatial patchifiers, each with unique ratio/kernels, are used to compress frames at differing granularities.
  • Importance-Guided Assignment: Selected frames are sorted by relevance score S[f]S[f]; frames with highest scores are mapped to patchifiers with the finest spatial resolution. Less central frames are compressed more aggressively, optimizing the context-token budget.
  • Token Construction and Injection: Patchifiers transform assigned frames to tokens CC_\ell, producing the final context C=concat(C1,,CLp)C = \text{concat}(C_1,\ldots,C_{L_p}). These tokens are concatenated with noisy latent tokens for the current shot and routed to the diffusion transformer's attention mechanism.
  • Direct Conditioning: The DiT blocks permit full-attention mixing between noise and context tokens, ensuring comprehensive semantic fusion.

The joint operation of Frame Selection and Adaptive Conditioner yields O(Ksel)O(K_{\text{sel}}) context complexity, making global narrative tracking computationally tractable (An et al., 8 Dec 2025).

5. Architecture and Training Strategies

The OneStory backbone is Wan2.1—a diffusion transformer with a 3D-VAE latent interface. Architectural specifics:

  • Encoder: Pretrained 3D-VAE (ft=10f_t=10, fs=8f_s=8, Dv256D_v \approx 256).
  • Diffusion Transformer (𝒢𝒢): 48 DiT layers, multihead attention (D=1024D=1024), text and context cross-attention, U-Net-style spatial blocks.
  • Pretraining / Finetuning: Initialized from Wan2.1 trained on single-shot textual or visual data, then finetuned for one epoch over 60K curated multi-shot videos using AdamW (lr=5×104lr=5 \times10^{-4}, wd=102wd=10^{-2}) on 128 A100 GPUs.

Training strategies under the next-shot regime include:

  • Unified Three-Shot Training: All dataset samples are expanded into shot triplets via cross-video insertion/first-shot augmentation.
  • Two-Stage Curriculum: Warm-up with decoupled (random) conditioning; full coupling thereafter with frame selector active.
  • Objective:

Ltotal=Ldiffusion(G();Slast)+λLselection(Spred,ypseudo)L_{\text{total}} = L_{\text{diffusion}}(\mathcal{G}(\ldots); S_{\text{last}}) + \lambda \cdot L_{\text{selection}}(S_{\text{pred}}, y_{\text{pseudo}})

where LdiffusionL_{\text{diffusion}} is rectified-flow diffusion loss and LselectionL_{\text{selection}} is a mean-squared error on frame scores (An et al., 8 Dec 2025).

6. Dataset Curation and Description

A new dataset of ~60K multi-shot, human-centric videos supports the OneStory framework:

  • Source: Research-copyright videos.
  • Shot Detection: TransNetV2; clips with ≥2 shots retained.
  • Two-Stage Captioning: Automated captions for each shot by a multimodal LLM, then rewritten for reference continuity (e.g., "the same man," description of transformations, etc.).
  • Quality Control: Unsafe or irrelevant transitions filtered (keyword blocklists, CLIP+SigLIP2 negative checks, DINOv2 for duplicate detection).
  • Statistics: ≈50K two-shot and 10K three-shot sequences; shots are 8–12 frames at 480×832 pixels (An et al., 8 Dec 2025).

7. Empirical Evaluation and Analysis

Metrics

OneStory performance is evaluated using both inter-shot and intra-shot coherence metrics, semantic alignment, and aesthetic/dynamic criteria:

Metric Type Metric Name Measurement Principle
Inter-shot coherence Character/environment consistency DINOv2 cosine sim. (segment/scene persistent)
Intra-shot coherence Subject/background consistency DINOv2 within-shot
Semantic alignment Caption-shot alignment ViCLIP cosine sim.
Shot quality Aesthetic quality Learned aesthetic predictor
Dynamic degree Optical flow magnitude

Baselines and Results

Baselines include Mask²DiT (fixed window), StoryDiff + Wan2.1/LTX-Video (keyframe pipelines), and FLUX + Wan2.1 (edit-extend). OneStory demonstrates:

  • Improvements of ≥3–5% in inter-shot coherence and semantic alignment (Text→MSV, Image→MSV) relative to all baselines.
  • Ablations indicate essential roles for both Frame Selection and Adaptive Conditioner, with unified three-shot training and curriculum approaches conferring +2–3% further gains.
  • Qualitative behaviors include correctly reintroducing entities after distractor shots, compositional scene merging, and robust geometric/cinematic transitions (An et al., 8 Dec 2025).

8. Discussion, Limitations, and Future Directions

Adaptive global memory enables O(Ksel)O(K_{\text{sel}}) context referencing, supporting scalability to longer narratives than previous methods, which scale as O(FNs)O(F \cdot N_s). The selector–conditioner architecture reflects selective recall, akin to human compositional storytelling. Documented limitations include:

  • Autoregressive Length Constraint: Current training stabilizes at three-shot chains; extending to ≥10 shots likely requires explicit long-horizon planners.
  • Absence of Explicit “Global Script”: No explicit mechanism for foreshadowing or multi-episode planning; potential extensions include LLM-driven narrative planners.
  • Lack of Multi-Modal Support: Audio and speech-augmented MSV remain unexplored within OneStory.

A plausible implication is that integrating hierarchical narrative models and multimodal encoders could address current shortcomings and further advance controllable long-form video generation (An et al., 8 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to OneStory Framework.