- The paper introduces a three-stage autoregressive framework that integrates multimodal control injection, degradation-aware training, and history context guidance.
- It achieves state-of-the-art performance with improved aesthetic quality, SSIM, LPIPS, and temporal consistency across minute-scale video generation.
- The work sets a new benchmark, LongVGenBench, for evaluating controllability and coherence in ultra-long video world models.
LongVie 2: Multimodal Controllable Ultra-Long Video World Model
Introduction and Motivation
The "LongVie 2: Multimodal Controllable Ultra-Long Video World Model" (2512.13604) addresses the challenge of constructing video world models that unify fine-grained controllability, long-term visual fidelity, and temporal consistency within a scalable autoregressive diffusion framework. While recent architectures can synthesize short high-quality video clips, their ability to maintain controllability and spatiotemporal coherence degrades as generation duration increases, often resulting in visual artifacts and semantic drift over time.
Figure 1: Unstable long-term generation—visual and control quality deteriorate as video duration increases in prior video world models.
The paper posits that a robust video world model should integrate global and local controls across dense and sparse modalities, remain resilient to the cumulative degradation encountered during long-term autoregressive inference, and leverage temporal history to preserve cross-clip coherence. LongVie 2 systematically integrates these principles into both its architectural design and progressive training scheme.
Methodology
Three-Stage Training Paradigm
LongVie 2 adopts a three-stage autoregressive framework, progressing from short-clip control adaptation towards robust minute-scale generation with history-aware temporal regularization. The training progression is as follows:
- Multi-Modal Control Injection: This stage employs dense (depth maps) and sparse (tracked keypoints) conditioning inputs. Specialized, lightweight DiT control branches, initialized from a large diffusion backbone, modulate the generative process, combining structural scene information and semantic motion cues. A degradation-based regularization scheme encourages balanced fusion, mitigating control domination by dense modality.
- Degradation-Aware Frame Training: During training, input frames are intentionally degraded to bridge the distribution shift between clean training images and the progressively deteriorated frames encountered during extended autoregressive rollouts. Two complementary mechanisms are utilized: repeated VAE encode-decode cycles and controlled diffusion-noise application, both calibrated stochastically to span typical decoherence profiles.
Figure 2: Framework of LongVie 2: combines multimodal control, degradation-aware training, and history context for controllable, coherent long-horizon generation.
Figure 3: Training pipeline: sequential application of ControlNet-based control learning, frame degradation, and history integration.
Figure 4: Frame Degradation—simulated via VAE encode-decode and latent denoising to anticipate quality decay during sampling.
- History Context Guidance: For each new video segment, tail frames from preceding clips (the historical context) are injected as auxiliary inputs. Degradation is also simulated on historical frames to align domains. Special temporal loss components enforce both low-frequency (structure) and high-frequency (details) alignment at clip transitions, with exponentially weighted regularization to stabilize the crucial clip-boundary frames.
Training-Free Strategies
Two non-parametric techniques further enhance long-range coherence:
- Unified Noise Initialization: The same noise latent is shared across all clips in an autoregressive rollout, reducing stochastic inconsistency at clip transitions.
- Global Normalization: Control signals such as depth maps are normalized globally across an entire video, mitigating local intensity drifts and maintaining consistent physical meaning.
Evaluation and Experimental Results
LongVGenBench
A new benchmark, LongVGenBench, is released with 100 diverse 1-minute, high-resolution videos spanning real-world and synthetic domains. Each is segmented for autoregressive generation, providing rigorous evaluation of both controllability and coherence.
Figure 5: LongVGenBench—diverse real and synthetic scenarios for controllable long-video generation benchmarking.
Quantitative Results
LongVie 2 consistently establishes new SOTA performance across visual quality, controllability, and temporal metrics, notably:
- Aesthetic Quality: 58.47% (vs. 56.18% for prior best)
- Imaging Quality: 69.77%
- SSIM: 0.529
- LPIPS: 0.295 (lower is better)
- Subject Consistency: 91.05%
- Background Consistency: 92.45%
- Dynamic Degree/Overall Consistency: Significant improvements in all columns
Ablations on each training stage confirm complementary gains, with history context guidance critical for eliminating boundary discontinuities.
Qualitative Analysis
Figure 6: Ablation—each training stage successively improves visual quality and removes intra-clip artifacts.
Figure 7: Controllability—LongVie 2 maintains precise structural and appearance alignment compared to GWF and DAS.
Figure 8: Long-term generation—minute-plus sequences retain stability, realism, and world-level guidance.
Minute- to five-minute generation results further support the robustness of LongVie 2. The system demonstrates sustained detail, persistent style and semantic alignment, and physical world repetition or manipulation across both subject-driven and subject-free scenarios.
Figure 9: Subject-driven 1-minute scenario—style manipulation across seasons while retaining subject and motion.
Figure 10: Subject-free—long drone flight maintaining global and seasonal consistency.
Figure 11: 3-minute scenario—long-term consistency across style transfers in a mountain drive.
Figure 12: 5-minute scenarios—subject-driven and subject-free, indicating model stability and cross-clip regularity at scale.
Contributions and Theoretical Implications
LongVie 2 demonstrates that progressive integration of multi-modal controllability, explicit degradation modeling, and history-informed context can effectively resolve the persistent controllability-coherence-fidelity trilemma in long-range video generation. By leveraging base pretrained video backbones and optimizing only control and self-attention layers, the method is both parameter-efficient and adaptable to future model scaling.
It establishes the necessity of matching training distributions to iterative inference decoherence and confirms the value of history-aware regularization, aligning with findings from recent works on self-forcing and error recycling. Methodologically, the approach supports direct extension to more diverse conditioning modalities, finer temporal granularity, and further resolution scaling.
Practical Impact and Future Directions
The release of LongVGenBench provides a unified standard for holistic benchmarking of world-grounded controllable long video generators. LongVie 2's robust controllability—supporting both motion and style transfer—and stable generation over multi-minute durations establishes a practical basis for simulation, VFX, robotics, and virtual environment modeling.
Notable open directions include scaling to higher spatial resolutions, finer scene-level interaction and relighting, unifying latent-world representations, and integrating closed-loop agent-environment feedback for fully interactive “generalist” video world models.
Conclusion
LongVie 2 (2512.13604) constitutes a technical advance towards unified, controllable, and coherent world modeling in video. By addressing both core architectural and training discrepancies and substantiating its claims with strong empirical benchmarks and ablation, the work sets a foundation for future research targeting “video as a world model” at larger scale and in more complex interactive settings.