- The paper introduces a unified model for autoregressive video generation by adapting LLM architectures with minimal modifications.
- It proposes MM-RoPE, a novel spatiotemporal position encoding that partitions embedding channels to balance temporal and spatial frequency representation.
- The AR-DF training paradigm, featuring temporal tube masking, achieves competitive benchmark performance while maintaining resource efficiency.
LumosGen: Advancing Autoregressive Video Generation with Unified LLM Architectures
LumosGen presents a unified approach to autoregressive video generation, leveraging the architectural principles of LLMs with minimal modifications. The work addresses key challenges in extending LLMs to the spatiotemporal domain, proposing novel solutions for position encoding and training dynamics that are critical for high-fidelity, efficient video synthesis.
Unified Model Design and Architectural Minimalism
LumosGen is architected to closely follow the Llama backbone, incorporating RMSNorm, SwiGLU, and QK-Norm for stability and efficiency. The model employs a unified discrete codebook for both language and visual tokens, enabling seamless multimodal processing. This design choice allows for joint training on images and videos of varying aspect ratios without the need for resizing, and supports a range of generative tasks including text-to-image, image-to-video, and text-to-video.
Spatiotemporal Position Encoding: MM-RoPE
A central contribution is the Multi-Modal Rotary Position Embedding (MM-RoPE), which extends the standard RoPE mechanism to better capture the spatiotemporal correlations inherent in video data. The authors identify that naive 1D or even conventional 3D RoPEs suffer from imbalanced frequency allocation, leading to suboptimal modeling of temporal and spatial dependencies. MM-RoPE addresses this by:
- Distributed Frequency Allocation: Embedding channels are partitioned into multiple meta-groups, each interleaving temporal, height, and width information, ensuring comprehensive frequency coverage for all dimensions.
- Scaled Position Encoding: The positional indices for visual tokens are empirically scaled to match the resolution of the RGB space, balancing the representation power between language and vision modalities.
Empirical ablations demonstrate that MM-RoPE yields faster convergence and lower validation loss compared to prior RoPE variants, with negligible inference overhead (<5% latency increase).
Autoregressive Discrete Diffusion Forcing (AR-DF)
To address the inefficiencies and loss imbalances in training autoregressive video generators, LumosGen introduces AR-DF, a training and inference paradigm that:
- Temporal Tube Masking: During training, a random mask pattern is generated for the first frame and repeated across all subsequent frames, mitigating spatial information leakage and ensuring that temporal dynamics are genuinely learned.
- Inference-Time Masking: At inference, partial context masking is applied to each generated frame, mirroring the training regime and preserving both frame quality and motion coherence.
This approach is formalized in detailed training and inference algorithms, and ablation studies confirm that AR-DF is essential for preventing artifacts and maintaining temporal consistency in generated videos.
Implementation and Resource Efficiency
LumosGen is trained from scratch on 60M images and 10M videos using only 48 GPUs, a notably modest resource footprint compared to contemporary large-scale video models. The implementation incorporates:
- Flash Attention for memory-efficient training and inference.
- Chunked Cross-Entropy Loss to handle the large codebook size without exceeding GPU memory limits.
- Stage-Wise Training: Initial text-to-image training is followed by joint image-video training, progressively increasing resolution and complexity.
Empirical Results
LumosGen achieves competitive or superior results on standard benchmarks:
- GenEval (Text-to-Image): Outperforms diffusion models of similar size and matches the performance of larger autoregressive models such as EMU3, particularly excelling in position and attribute binding metrics.
- VBench-I2V (Image-to-Video): Matches the performance of COSMOS-Video2World, despite using an order of magnitude less data and compute.
- VBench-T2V (Text-to-Video): Comparable to OpenSoraPlan and EMU3, with strong object-centric and color consistency metrics.
Qualitative comparisons show that LumosGen produces videos with natural motion, prompt fidelity, and multi-object coherence, often surpassing both diffusion and autoregressive baselines in visual alignment and temporal stability.
Implications and Future Directions
Practical Implications:
- Unified Multimodal Generation: The architectural minimalism and unified codebook facilitate deployment in scenarios requiring both language and visual generation, reducing the need for separate models or external text encoders.
- Resource Efficiency: The ability to train high-performing video generators on modest hardware democratizes access to advanced generative models.
- Flexible Aspect Ratio Support: The model's robustness to varying aspect ratios broadens its applicability in real-world multimedia applications.
Theoretical Implications:
- Position Encoding in Multimodal Transformers: MM-RoPE provides a principled approach to balancing frequency spectra across modalities and dimensions, which may inform future work in multimodal and spatiotemporal transformers.
- Training Dynamics for Autoregressive Video Models: AR-DF highlights the importance of aligning training and inference masking strategies to prevent shortcut learning and ensure genuine temporal modeling.
Future Developments:
- Scaling Data and Model Size: The authors note that further scaling of data and model capacity is likely to yield additional gains, particularly in complex action and scene understanding.
- Multimodal Knowledge Infusion: Integrating pretrained vision-LLMs or co-training with understanding tasks could enhance grounding and generalization.
- Alignment and Safety: As with all generative models, alignment tuning and robust filtering mechanisms will be necessary for responsible deployment.
Conclusion
LumosGen demonstrates that autoregressive video generation can be effectively realized within the LLM paradigm, provided that spatiotemporal position encoding and training dynamics are carefully addressed. The proposed MM-RoPE and AR-DF mechanisms are both theoretically motivated and empirically validated, offering a practical blueprint for unified, efficient, and high-quality multimodal generative models. The work sets a new standard for resource-efficient, scalable video generation and opens avenues for further research in unified multimodal modeling.