Long Context Tuning for Video Generation (2503.10589v1)

Published 13 Mar 2025 in cs.CV

Abstract: Recent advances in video generation can produce realistic, minute-long single-shot videos with scalable diffusion transformers. However, real-world narrative videos require multi-shot scenes with visual and dynamic consistency across shots. In this work, we introduce Long Context Tuning (LCT), a training paradigm that expands the context window of pre-trained single-shot video diffusion models to learn scene-level consistency directly from data. Our method expands full attention mechanisms from individual shots to encompass all shots within a scene, incorporating interleaved 3D position embedding and an asynchronous noise strategy, enabling both joint and auto-regressive shot generation without additional parameters. Models with bidirectional attention after LCT can further be fine-tuned with context-causal attention, facilitating auto-regressive generation with efficient KV-cache. Experiments demonstrate single-shot models after LCT can produce coherent multi-shot scenes and exhibit emerging capabilities, including compositional generation and interactive shot extension, paving the way for more practical visual content creation. See https://guoyww.github.io/projects/long-context-video/ for more details.

Summary

The paper introduces Long Context Tuning (LCT), a fine-tuning paradigm that adapts single-shot video diffusion models to generate coherent multi-shot scenes by expanding their temporal context.
LCT utilizes expanded full attention across all shots, interleaved 3D positional embeddings, and an asynchronous noise strategy for flexible joint or auto-regressive generation.
This method significantly improves visual and dynamic consistency across shots compared to baselines and enables emergent capabilities like compositional video generation and interactive scene extension.

Addressing Coherence in Multi-Shot Videos

Generating long, visually consistent videos that tell a coherent story or depict a complex event remains a significant challenge. While current video generation models, particularly those using scalable diffusion transformers (DiT), excel at creating high-fidelity, short (e.g., minute-long) single-shot videos, they often struggle with narrative content. Real-world videos, like films or documentaries, are typically constructed from multiple shots assembled into scenes. Maintaining visual and dynamic consistency across these shots is crucial for viewer comprehension and engagement. Single-shot models often fail to capture this scene-level coherence, resulting in jarring transitions and inconsistent elements between shots. Long Context Tuning (LCT) is a training paradigm developed to address this limitation by adapting pre-trained single-shot video diffusion models to generate coherent multi-shot scenes (2503.10589). LCT achieves this by expanding the model's temporal context window, enabling it to learn scene-level consistency directly from data without introducing new parameters.

1. The Challenge of Multi-Shot Video Generation

The core difficulty lies in maintaining consistency across shot boundaries. A scene might involve the same characters, objects, and environment depicted from different angles or performing sequential actions. A generative model must ensure that:

Visual Consistency: Characters and objects maintain their appearance across shots. The environment remains stable unless intentionally changed by the narrative.
Dynamic Consistency: Actions flow logically from one shot to the next. Motion styles and physics should remain plausible across the scene.
Narrative Coherence: The sequence of shots effectively tells a story or depicts an event in a comprehensible manner.

Traditional video generation techniques, including powerful diffusion models trained on single shots, lack the mechanism to explicitly model these long-range, cross-shot dependencies inherent in scene construction. Generating shots independently often leads to inconsistencies that break the illusion of a unified scene.

2. Implementing Long Context Tuning (LCT)

LCT adapts existing single-shot video diffusion models (like DiTs) by fine-tuning them on multi-shot scene data. The key innovation is enabling the model's attention mechanism to operate over a much longer context window encompassing all shots in a scene simultaneously. This is achieved through several key components:

Expanded Context via Full Attention: LCT modifies the transformer's attention mechanism to operate across the concatenated sequence of all text prompts and video tokens for all shots within a scene. For a scene with $N$ shots, the token sequence becomes [text1]-[video1]-[text2]-[video2]-...-[textN]-[videoN]. The self-attention mechanism is applied globally across this entire sequence, allowing every token to attend to every other token, regardless of which shot it belongs to. This enables the model to directly learn and enforce relationships and dependencies between different shots.
Interleaved 3D Rotary Position Embedding (RoPE): To allow the model to differentiate tokens belonging to different shots while still understanding the spatial (within-frame) and temporal (within-shot) structure, LCT employs an interleaved 3D RoPE. This embedding scheme preserves the relative positional encoding learned by the pre-trained single-shot model within each shot but assigns distinct absolute positional information to differentiate between shots. Conceptually, the token groups for each shot are placed sequentially, allowing the model to know "this token is in shot 1" versus "this token is in shot 3," while still understanding spatial relationships within frame 5 of shot 3.
Asynchronous Noise Strategy: During training, instead of applying the same diffusion timestep (noise level) $t$ to all shots in a scene, LCT samples independent timesteps $t_1, t_2, ..., t_N$ for each shot. This allows shots with lower noise levels (closer to the clean video) to implicitly act as visual conditions for denoising other shots with higher noise levels within the same attention pass. This strategy elegantly unifies the handling of diffusion samples and visual conditioning inputs without needing separate conditioning networks or mechanisms. During inference, this provides flexibility: setting low noise for some shots turns them into visual conditions for generating the remaining shots. Synchronizing timesteps enables joint generation of all shots.
Training Procedure: LCT involves fine-tuning a pre-trained single-shot video model on a dataset composed of multi-shot scenes. Each scene is accompanied by a global prompt describing the overall scene and potentially individual prompts for each shot. The model learns to denoise the shots within the scene context, guided by the text prompts and the relationships learned through the expanded attention mechanism.

3. Attention Variants: Bidirectional vs. Context-Causal

LCT primarily uses a full bidirectional attention mechanism where every token attends to every other token in the entire scene sequence. This allows information to flow freely in both directions, enabling the model to capture complex inter-shot dependencies and facilitating joint generation where all shots are synthesized simultaneously.

However, the LCT model can be further fine-tuned using context-causal attention. This variant modifies the attention mask:

Attention within a shot remains bidirectional (tokens attend to all others in the same shot).
Attention across shots becomes causal (tokens in shot $i$ can only attend to tokens in shots $j \le i$ ).

This structure is particularly advantageous for auto-regressive generation (generating shots sequentially). Key benefits include:

KV-Caching: The Key and Value projections computed for previous shots can be cached and reused when generating subsequent shots. This significantly reduces redundant computations and speeds up inference, especially for long scenes.
Improved Historical Fidelity: Enforcing a stricter sequential dependency can sometimes lead to better adherence to the conditions set by previously generated shots.

4. Generating Videos with LCT

LCT supports two primary modes of video generation:

Joint Generation: All shots in the scene are generated simultaneously in a single pass. This typically uses the bidirectional attention model. It leverages the full context of the entire scene description and relationships between all planned shots.
Auto-regressive Generation: Shots are generated sequentially, one after another. Shot $i$ is generated conditioned on the previously generated shots $1, ..., i-1$. This mode often benefits from the context-causal attention mechanism for efficiency via KV-caching. It allows for interactive workflows where a scene can be extended shot by shot.

The asynchronous noise strategy facilitates conditioning in both modes. For instance, in auto-regressive generation, the previously generated shots can be fed into the model with very low (or zero) noise, effectively acting as strong visual conditions for the next shot being generated.

5. Demonstrated Capabilities and Performance

Experiments detailed in the LCT paper (2503.10589) show significant improvements over baseline methods:

Improved Coherence and Quality: LCT models achieved better scores on metrics like Fréchet Video Distance (FVD) for realism and CLIP score for text alignment compared to single-shot baselines applied independently to each shot. Crucially, specialized multi-shot consistency metrics showed that LCT generates scenes with significantly higher visual and dynamic consistency across shots (e.g., stable character appearance, consistent environments). Qualitative examples visually confirmed this improvement.
Emergent Capabilities: The expanded context learning enables new functionalities:
- Compositional Generation: The model can synthesize scenes by combining different elements provided as conditions. For example, generating a video featuring a specific character (from an image) within a described environment, maintaining consistency across multiple generated shots.
- Interactive Shot Extension: Users can generate an initial shot and then auto-regressively extend the scene by generating subsequent shots that logically follow, using the previous shots as context. This allows for iterative refinement and control over the narrative flow.

6. Practical Applications and Implementation Considerations

LCT's ability to generate coherent multi-shot scenes opens up various applications:

Filmmaking: Pre-visualization, storyboarding, generating rough cuts from scripts.
Advertising: Creating multi-scene ads with consistent branding and products.
Interactive Storytelling: Enabling dynamic video generation in games or narrative experiences.
Content Creation Tools: Augmenting video editing software with scene generation capabilities.
Education: Generating custom video explanations or historical visualizations.

Key implementation points include:

Base Model: LCT requires a capable pre-trained single-shot video diffusion model, typically a Diffusion Transformer (DiT), as its foundation.

Computational Cost: The primary bottleneck is the full attention mechanism, whose computation and memory requirements scale quadratically (

O(L^2)

where

L

is the total number of tokens across all shots). This can become prohibitive for very long scenes (many shots or long shots).

# Pseudocode illustrating sequence length
num_shots = 10
tokens_per_shot = 1024 # Example (depends on resolution, duration, patch size)
total_tokens = num_shots * tokens_per_shot
# Attention complexity scales with total_tokens^2

Context-causal attention with KV-caching significantly reduces inference cost for auto-regressive generation, making it more practical.

Data: Fine-tuning requires a dataset of multi-shot video scenes with associated text descriptions (global and potentially per-shot).
Deployment: For interactive applications or generating very long sequences, the auto-regressive approach with context-causal attention and KV-caching is generally preferred due to better scalability during inference compared to joint generation with full bidirectional attention.

7. Limitations and Future Directions

While LCT represents a significant step forward, limitations remain:

Scalability: The quadratic complexity of full attention limits the practical length (number of shots and duration) of scenes that can be processed efficiently, especially during training.
Computational Resources: Training and running LCT models, particularly with full attention, demand substantial GPU memory and compute power.

Future research could explore:

Efficient Attention Mechanisms: Investigating sparse attention patterns or hierarchical attention structures to reduce the computational cost of handling very long contexts.
Architectural Improvements: Exploring modifications to the underlying diffusion transformer architecture specifically for multi-shot generation.
Enhanced Conditioning: Incorporating richer conditioning signals (e.g., explicit structure graphs, detailed motion plans) beyond text prompts.
Domain Adaptation: Applying and evaluating LCT in diverse video domains like animation, simulation, or special effects generation.

In conclusion, Long Context Tuning provides a practical framework for extending powerful single-shot video models to generate coherent multi-shot scenes, enabling more complex narrative video synthesis and unlocking new creative possibilities. Its core techniques—expanded context attention, interleaved positional embeddings, and asynchronous noise—effectively address the challenge of scene-level consistency, paving the way for more capable and versatile video generation systems.