Lumos-1: On Autoregressive Video Generation from a Unified Model Perspective (2507.08801v1)

Published 11 Jul 2025 in cs.CV, cs.AI, and cs.MM

Abstract: Autoregressive LLMs have unified a vast range of language tasks, inspiring preliminary efforts in autoregressive video generation. Existing autoregressive video generators either diverge from standard LLM architectures, depend on bulky external text encoders, or incur prohibitive latency due to next-token decoding. In this paper, we introduce Lumos-1, an autoregressive video generator that retains the LLM architecture with minimal architectural modifications. To inject spatiotemporal correlations in LLMs, we identify the efficacy of incorporating 3D RoPE and diagnose its imbalanced frequency spectrum ranges. Therefore, we propose MM-RoPE, a RoPE scheme that preserves the original textual RoPE while providing comprehensive frequency spectra and scaled 3D positions for modeling multimodal spatiotemporal data. Moreover, Lumos-1 resorts to a token dependency strategy that obeys intra-frame bidirectionality and inter-frame temporal causality. Based on this dependency strategy, we identify the issue of frame-wise loss imbalance caused by spatial information redundancy and solve it by proposing Autoregressive Discrete Diffusion Forcing (AR-DF). AR-DF introduces temporal tube masking during training with a compatible inference-time masking policy to avoid quality degradation. By using memory-efficient training techniques, we pre-train Lumos-1 on only 48 GPUs, achieving performance comparable to EMU3 on GenEval, COSMOS-Video2World on VBench-I2V, and OpenSoraPlan on VBench-T2V. Code and models are available at https://github.com/alibaba-damo-academy/Lumos.

Summary

The paper introduces a unified LLM-based model that seamlessly integrates text and visual tokens for text-to-image, image-to-video, and text-to-video generation.
The paper's novel MM-RoPE mechanism interleaves temporal, height, and width encodings to enhance spatiotemporal modeling and cross-modal alignment.
The paper employs AR-DF strategies with temporal tube and partial context masking to optimize training and inference for efficient, high-quality video synthesis.

Lumos-1: A Unified Autoregressive Model for Video Generation

LumosGen presents a unified approach to autoregressive video generation, leveraging the architectural principles of LLMs with minimal modifications. The work addresses key challenges in extending LLMs to the spatiotemporal domain, proposing novel mechanisms for position encoding and training dynamics that are critical for high-fidelity, efficient video synthesis.

Model Architecture and Unified Tokenization

LumosGen adopts the Llama architecture as its backbone, integrating both language and visual modalities into a single transformer. The model utilizes a unified discrete codebook for text and visual tokens, enabling seamless handling of text-to-image, image-to-video, and text-to-video tasks within a single framework. This design choice eliminates the need for external text encoders, reducing system complexity and inference latency.

Implementation Details:

Tokenization: Text tokens use Chameleon's tokenizer; visual tokens are discretized via a VQ-based tokenizer. The codebook is partitioned (65,536 text, 64,000 visual tokens).
Sequence Formatting: Text and visual tokens are interleaved, with text tokens encoding metadata (prompt, resolution, fps, frame count). This supports variable aspect ratios and flexible conditioning.
Architecture: RMSNorm, SwiGLU, and QK-Norm are incorporated for stability and efficiency. Model sizes range from 0.5B to 3.6B parameters.

Spatiotemporal Position Encoding: MM-RoPE

A central contribution is the Multi-Modal Rotary Position Embedding (MM-RoPE), designed to inject spatiotemporal priors into the transformer attention mechanism. Standard 1D RoPE, effective for text, is insufficient for video due to the need to model temporal and spatial dependencies jointly.

Key Innovations:

Distributed Frequency Allocation: MM-RoPE interleaves temporal, height, and width positional encodings across the embedding space, ensuring all dimensions receive a comprehensive frequency spectrum. This contrasts with prior 3D RoPE variants, which often underrepresent spatial axes.
Scaling for Modality Balance: The positional indices for visual tokens are empirically scaled to match the range of text tokens, improving cross-modal alignment and preventing underfitting of visual context.
Plug-and-Play for Unified Models: MM-RoPE preserves the original RoPE for text tokens, maintaining LLMing capacity while enhancing visual generation.

Practical Implications:

MM-RoPE introduces negligible inference overhead (<5% latency increase).
Ablation studies confirm that distributed frequency allocation is the dominant factor for improved convergence and generation quality.

Autoregressive Discrete Diffusion Forcing (AR-DF)

To address the inefficiency and loss imbalance inherent in next-token prediction for video, LumosGen introduces AR-DF, a training and inference strategy inspired by discrete diffusion models.

Training:

Temporal Tube Masking: Instead of random masking, a mask pattern is sampled for the first frame and repeated across all frames. This prevents spatial information leakage and ensures that later frames are not trivially predictable from earlier ones.
Loss Computation: Cross-entropy loss is computed only on unmasked tokens, focusing learning on genuinely uncertain regions.

Inference:

Partial Context Masking: At each step, a portion of the generated frame is masked (with a tunable ratio), mirroring the training regime. This maintains consistency between training and inference, preventing quality and motion degradation.
Efficient Decoding: The approach supports parallel decoding within frames (intra-frame bidirectionality) and sequential decoding across frames (inter-frame causality), balancing quality and efficiency.

Implementation Considerations:

Memory Efficiency: Flash attention and chunked cross-entropy are used to manage the large codebook and long sequences.
Stage-wise Training: The model is first trained on text-to-image, then on joint image/video data, and finally on higher resolutions, ensuring stable convergence.

Empirical Results

LumosGen demonstrates strong performance across multiple benchmarks:

Text-to-Image (GenEval): Achieves results on par with or exceeding diffusion models of similar or larger size, particularly in position and attribute binding.
Image-to-Video (VBench-I2V): Matches or surpasses leading models, despite using significantly less data and compute.
Text-to-Video (VBench-T2V): Comparable to state-of-the-art diffusion models, with robust object-centric metrics.

Notably, LumosGen achieves these results with only 48 GPUs, highlighting the efficiency of the unified architecture and training scheme.

Analysis and Ablations

MM-RoPE: Increasing the number of meta MM-RoPE groups (i.e., finer frequency slicing) consistently improves spatiotemporal modeling.
Scaling Factors: Moderate scaling of positional indices (e.g., (4,8,8) for temporal, height, width) suffices; excessive scaling yields diminishing returns.
AR-DF Mask Ratio: Inference quality is robust for mask ratios between 0.3 and 0.7; outside this range, either context is insufficient or temporal continuity is disrupted.
Aspect Ratio Robustness: The model generalizes well to unseen aspect ratios due to the unified codebook and flexible sequence formatting.

Implications and Future Directions

Practical Implications:

Unified Multimodal Generation: LumosGen's architecture supports flexible conditioning and multi-task generation without architectural changes or external encoders.
Efficient Training and Inference: The combination of MM-RoPE and AR-DF enables high-quality video synthesis with manageable compute and memory requirements.
Scalability: The approach is amenable to larger datasets and model sizes, with clear paths for scaling.

Theoretical Implications:

Position Encoding in Multimodal Transformers: The work highlights the importance of frequency allocation and modality balancing in position encoding for unified models.
Training Dynamics in Autoregressive Video Models: Temporal tube masking and partial context inference address fundamental challenges in loss balancing and temporal coherence.

Future Developments:

Scaling Data and Model Size: Expanding the training corpus and model capacity is expected to further improve generalization, especially for complex dynamics and rare scenarios.
Multimodal Knowledge Infusion: Integrating pretrained vision-LLMs or co-training with understanding tasks could enhance grounding and compositionality.
Advanced Position Encoding: More sophisticated scaling or adaptive position encoding strategies may yield further gains in cross-modal alignment.

Conclusion

LumosGen establishes a practical and theoretically grounded framework for unified autoregressive video generation. By addressing spatiotemporal position encoding and training-inference consistency, it demonstrates that LLM architectures can be effectively extended to high-quality, efficient video synthesis. The proposed techniques—MM-RoPE and AR-DF—are broadly applicable to future unified multimodal models, with significant implications for both research and deployment in generative AI.

PDF Markdown

Related Papers

GitHub

GitHub - alibaba-damo-academy/Lumos: Lumos Project: Frontier generative model research by Alibaba DAMO Academy, including Lumos-1, etc. (15 stars)

Tweets

https://twitter.com/_akhaliq/status/1944593449262440712

https://twitter.com/wildmindai/status/1944651426946273668

YouTube

Show All Videos