Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 69 tok/s

Gemini 2.5 Pro 53 tok/s Pro

GPT-5 Medium 39 tok/s Pro

GPT-5 High 39 tok/s Pro

GPT-4o 102 tok/s Pro

Kimi K2 174 tok/s Pro

GPT OSS 120B 454 tok/s Pro

Claude Sonnet 4.5 34 tok/s Pro

2000 character limit reached

Lumos-1: On Autoregressive Video Generation from a Unified Model Perspective (2507.08801v1)

Published 11 Jul 2025 in cs.CV, cs.AI, and cs.MM

Abstract: Autoregressive LLMs have unified a vast range of language tasks, inspiring preliminary efforts in autoregressive video generation. Existing autoregressive video generators either diverge from standard LLM architectures, depend on bulky external text encoders, or incur prohibitive latency due to next-token decoding. In this paper, we introduce Lumos-1, an autoregressive video generator that retains the LLM architecture with minimal architectural modifications. To inject spatiotemporal correlations in LLMs, we identify the efficacy of incorporating 3D RoPE and diagnose its imbalanced frequency spectrum ranges. Therefore, we propose MM-RoPE, a RoPE scheme that preserves the original textual RoPE while providing comprehensive frequency spectra and scaled 3D positions for modeling multimodal spatiotemporal data. Moreover, Lumos-1 resorts to a token dependency strategy that obeys intra-frame bidirectionality and inter-frame temporal causality. Based on this dependency strategy, we identify the issue of frame-wise loss imbalance caused by spatial information redundancy and solve it by proposing Autoregressive Discrete Diffusion Forcing (AR-DF). AR-DF introduces temporal tube masking during training with a compatible inference-time masking policy to avoid quality degradation. By using memory-efficient training techniques, we pre-train Lumos-1 on only 48 GPUs, achieving performance comparable to EMU3 on GenEval, COSMOS-Video2World on VBench-I2V, and OpenSoraPlan on VBench-T2V. Code and models are available at https://github.com/alibaba-damo-academy/Lumos.

Summary

The paper introduces a unified model for autoregressive video generation by adapting LLM architectures with minimal modifications.
It proposes MM-RoPE, a novel spatiotemporal position encoding that partitions embedding channels to balance temporal and spatial frequency representation.
The AR-DF training paradigm, featuring temporal tube masking, achieves competitive benchmark performance while maintaining resource efficiency.

LumosGen: Advancing Autoregressive Video Generation with Unified LLM Architectures

LumosGen presents a unified approach to autoregressive video generation, leveraging the architectural principles of LLMs with minimal modifications. The work addresses key challenges in extending LLMs to the spatiotemporal domain, proposing novel solutions for position encoding and training dynamics that are critical for high-fidelity, efficient video synthesis.

Unified Model Design and Architectural Minimalism

LumosGen is architected to closely follow the Llama backbone, incorporating RMSNorm, SwiGLU, and QK-Norm for stability and efficiency. The model employs a unified discrete codebook for both language and visual tokens, enabling seamless multimodal processing. This design choice allows for joint training on images and videos of varying aspect ratios without the need for resizing, and supports a range of generative tasks including text-to-image, image-to-video, and text-to-video.

Spatiotemporal Position Encoding: MM-RoPE

A central contribution is the Multi-Modal Rotary Position Embedding (MM-RoPE), which extends the standard RoPE mechanism to better capture the spatiotemporal correlations inherent in video data. The authors identify that naive 1D or even conventional 3D RoPEs suffer from imbalanced frequency allocation, leading to suboptimal modeling of temporal and spatial dependencies. MM-RoPE addresses this by:

Distributed Frequency Allocation: Embedding channels are partitioned into multiple meta-groups, each interleaving temporal, height, and width information, ensuring comprehensive frequency coverage for all dimensions.
Scaled Position Encoding: The positional indices for visual tokens are empirically scaled to match the resolution of the RGB space, balancing the representation power between language and vision modalities.

Empirical ablations demonstrate that MM-RoPE yields faster convergence and lower validation loss compared to prior RoPE variants, with negligible inference overhead (<5% latency increase).

Autoregressive Discrete Diffusion Forcing (AR-DF)

To address the inefficiencies and loss imbalances in training autoregressive video generators, LumosGen introduces AR-DF, a training and inference paradigm that:

Temporal Tube Masking: During training, a random mask pattern is generated for the first frame and repeated across all subsequent frames, mitigating spatial information leakage and ensuring that temporal dynamics are genuinely learned.
Inference-Time Masking: At inference, partial context masking is applied to each generated frame, mirroring the training regime and preserving both frame quality and motion coherence.

This approach is formalized in detailed training and inference algorithms, and ablation studies confirm that AR-DF is essential for preventing artifacts and maintaining temporal consistency in generated videos.

Implementation and Resource Efficiency

LumosGen is trained from scratch on 60M images and 10M videos using only 48 GPUs, a notably modest resource footprint compared to contemporary large-scale video models. The implementation incorporates:

Flash Attention for memory-efficient training and inference.
Chunked Cross-Entropy Loss to handle the large codebook size without exceeding GPU memory limits.
Stage-Wise Training: Initial text-to-image training is followed by joint image-video training, progressively increasing resolution and complexity.

Empirical Results

LumosGen achieves competitive or superior results on standard benchmarks:

GenEval (Text-to-Image): Outperforms diffusion models of similar size and matches the performance of larger autoregressive models such as EMU3, particularly excelling in position and attribute binding metrics.
VBench-I2V (Image-to-Video): Matches the performance of COSMOS-Video2World, despite using an order of magnitude less data and compute.
VBench-T2V (Text-to-Video): Comparable to OpenSoraPlan and EMU3, with strong object-centric and color consistency metrics.

Qualitative comparisons show that LumosGen produces videos with natural motion, prompt fidelity, and multi-object coherence, often surpassing both diffusion and autoregressive baselines in visual alignment and temporal stability.

Implications and Future Directions

Practical Implications:

Unified Multimodal Generation: The architectural minimalism and unified codebook facilitate deployment in scenarios requiring both language and visual generation, reducing the need for separate models or external text encoders.
Resource Efficiency: The ability to train high-performing video generators on modest hardware democratizes access to advanced generative models.
Flexible Aspect Ratio Support: The model's robustness to varying aspect ratios broadens its applicability in real-world multimedia applications.

Theoretical Implications:

Position Encoding in Multimodal Transformers: MM-RoPE provides a principled approach to balancing frequency spectra across modalities and dimensions, which may inform future work in multimodal and spatiotemporal transformers.
Training Dynamics for Autoregressive Video Models: AR-DF highlights the importance of aligning training and inference masking strategies to prevent shortcut learning and ensure genuine temporal modeling.

Future Developments:

Scaling Data and Model Size: The authors note that further scaling of data and model capacity is likely to yield additional gains, particularly in complex action and scene understanding.
Multimodal Knowledge Infusion: Integrating pretrained vision-LLMs or co-training with understanding tasks could enhance grounding and generalization.
Alignment and Safety: As with all generative models, alignment tuning and robust filtering mechanisms will be necessary for responsible deployment.

Conclusion

LumosGen demonstrates that autoregressive video generation can be effectively realized within the LLM paradigm, provided that spatiotemporal position encoding and training dynamics are carefully addressed. The proposed MM-RoPE and AR-DF mechanisms are both theoretically motivated and empirically validated, offering a practical blueprint for unified, efficient, and high-quality multimodal generative models. The work sets a new standard for resource-efficient, scalable video generation and opens avenues for further research in unified multimodal modeling.