Lumina-Video: Efficient and Flexible Video Generation with Multi-scale Next-DiT (2502.06782v2)

Published 10 Feb 2025 in cs.CV

Abstract: Recent advancements have established Diffusion Transformers (DiTs) as a dominant framework in generative modeling. Building on this success, Lumina-Next achieves exceptional performance in the generation of photorealistic images with Next-DiT. However, its potential for video generation remains largely untapped, with significant challenges in modeling the spatiotemporal complexity inherent to video data. To address this, we introduce Lumina-Video, a framework that leverages the strengths of Next-DiT while introducing tailored solutions for video synthesis. Lumina-Video incorporates a Multi-scale Next-DiT architecture, which jointly learns multiple patchifications to enhance both efficiency and flexibility. By incorporating the motion score as an explicit condition, Lumina-Video also enables direct control of generated videos' dynamic degree. Combined with a progressive training scheme with increasingly higher resolution and FPS, and a multi-source training scheme with mixed natural and synthetic data, Lumina-Video achieves remarkable aesthetic quality and motion smoothness at high training and inference efficiency. We additionally propose Lumina-V2A, a video-to-audio model based on Next-DiT, to create synchronized sounds for generated videos. Codes are released at https://www.github.com/Alpha-VLLM/Lumina-Video.

PDF Abstract

Lumina-Video presents a diffusion-based framework for video generation that extends the capabilities of Next-DiT by integrating multi-scale processing and explicit motion conditioning. The paper addresses the inherent spatiotemporal complexity of video data by introducing a Multi-scale Next-DiT architecture, which employs multiple patchification strategies to balance computational efficiency and quality.

The core technical contributions can be summarized as follows:

Multi-scale Next-DiT Architecture The proposed model integrates several patchify/unpatchify pairs with distinct spatiotemporal patch sizes (e.g., (1,2,2), (2,2,2), and (2,4,4)). All scales share a common DiT backbone, which minimizes parameter overhead and facilitates cross-scale knowledge sharing. This multi-scale patchification systematically decomposes the denoising process into stages where coarser patches are used in the early timesteps—capturing global structure—and finer patches are employed in later timesteps to recover high-frequency details. The architectural design leverages an analysis of loss curves at varying timesteps, providing strong empirical evidence that smaller patch sizes dominate later denoising phases.
Scale-Aware Timestep Shifting To further optimize the training process, a novel scale-aware timestep shifting strategy is introduced. By allocating different time shift factors to each patch size, the model samples timesteps from the full trajectory [0, 1] while biasing larger patch sizes to focus on early denoising stages and smaller patches on later stages. This approach maximizes improvements in efficiency without incurring significant degradation in output quality, thus offering a flexible interface during inference that can adapt to computational resource constraints.
Explicit Motion Control via Motion Score Conditioning In recognition of the need for dynamic control over video synthesis, the model conditions the diffusion process on a motion score derived from the magnitude of optical flow computed using UniMatch. By incorporating this motion score as an additional input similar to the timestep, and by manipulating it asymmetrically for positive and negative classifier-free guidance samples, the framework allows direct control over the dynamic degree of generated videos. Ablation studies detailed in the work demonstrate that modulating the difference between positive and negative motion conditions—rather than their absolute values—enables effective tuning of motion dynamics while maintaining consistency in content quality.
Progressive and Multi-Source Training Strategies The training procedure adopts a four-stage progressive approach beginning with text-to-image training and transitioning into joint text-to-video learning. Spatial resolution and frame rate are progressively increased through the stages (from 256 pixels/8 FPS to 960 pixels/24 FPS), ensuring that the model learns robust global structures first, and then refines fine-grained temporal details. Training data include a mix of natural and synthetic sources, with a multi-system prompt strategy to enable training on diverse distributions. Fine-tuning on a best-subset of system prompts further refines performance, particularly emphasizing stability and dynamic control.
Quantitative and Ablation Evaluations Evaluations on the VBench benchmark reveal that Lumina-Video achieves competitive overall scores and excels in quality, semantic consistency, and motion smoothness. The paper reports that the integrated multi-scale approach not only yields a better trade-off between quality and computational cost but also outperforms single-scale methods. For instance, ablation studies indicate that while using the smallest patch size throughout maximizes performance, a combination of scales achieves significant inference speed-up (with time cost reductions down to 0.07 and 0.36 units for coarser patch sizes) with only a minor drop in quality metrics. Detailed numerical comparisons underscore the reliability of motion conditioning, with notable improvements in dynamic degree when the difference between positive and negative motion scores is increased, albeit at the expense of slight degradation in semantic alignment.
Extension to Video-to-Audio Synthesis The framework is further extended through the introduction of Lumina-V2A, which aims to generate temporally synchronized ambient audio for silent videos. This extension leverages a Next-DiT-based video-to-audio model that incorporates co-attention between video, text, and audio modalities. The audio generation pipeline includes a pre-trained audio VAE, modulation via next-DiT blocks, and a HiFi-GAN vocoder to reconstruct audio waveforms from generated mel-spectrogram representations. Integration of multimodal conditioning modules ensures both semantic alignment and temporal synchronization between generated audio and video content.

In summary, Lumina-Video systematically improves video synthesis through a combination of multi-scale hierarchical diffusion, motion-aware conditioning, and advanced training schemes. The detailed ablation studies and extensive benchmarks affirm the effectiveness of its technical innovations in achieving a robust balance between high-fidelity generation and computational efficiency, while also paving the way for further integration of multimodal capabilities.

PDF Markdown Bookmark Chat (Pro)

Authors (19)

Dongyang Liu (14 papers)
Shicheng Li (23 papers)
Yutong Liu (21 papers)
Zhen Li (334 papers)
Kai Wang (624 papers)
Xinyue Li (34 papers)
Qi Qin (20 papers)
Yufei Liu (23 papers)
Yi Xin (28 papers)
Zhongyu Li (72 papers)
Bin Fu (74 papers)
Chenyang Si (36 papers)
Yuewen Cao (9 papers)
Conghui He (114 papers)
Ziwei Liu (368 papers)
Yu Qiao (563 papers)
Qibin Hou (81 papers)
Hongsheng Li (340 papers)
Peng Gao (401 papers)

Related Papers

Find Related Papers

GitHub

GitHub - Alpha-VLLM/Lumina-Video (71 stars)
GitHub - Alpha-VLLM/Lumina-Video (71 stars)

Tweets

https://twitter.com/arXivGPT/status/1889737822690488643

YouTube

Show All Videos