Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
98 tokens/sec
GPT-4o
11 tokens/sec
Gemini 2.5 Pro Pro
52 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
15 tokens/sec
DeepSeek R1 via Azure Pro
33 tokens/sec
Gemini 2.5 Flash Deprecated
12 tokens/sec
2000 character limit reached

Lumos-1: Unified Autoregressive Video Generation

Updated 14 July 2025
  • Lumos-1 is an autoregressive video generation framework that adapts standard LLM architecture with targeted modifications for video data.
  • It introduces MM‑RoPE to balance temporal, height, and width positional cues, enhancing spatiotemporal dependency modeling.
  • The framework employs AR‑DF and temporal tube masking to equalize frame learning and achieves efficient, high-quality video synthesis on modest hardware.

Lumos-1 refers to an autoregressive video generation framework grounded in the LLM paradigm, with minimal departures from standard LLM architecture. Its principal goal is to achieve scalable, unified video synthesis using techniques compatible with high-throughput, memory-efficient training, while addressing the fundamental challenges of spatiotemporal correlation modeling and token dependency inherent to the video domain (2507.08801).

1. Model Design and Spatiotemporal Positional Encoding

Lumos-1 adheres closely to the LLM architecture (e.g., LLaMA), introducing only targeted modifications to efficiently accommodate video data. The central innovation is the development of MM‑RoPE (“multimodal rotary position embedding”), which generalizes rotary position embeddings from text tokens to visual tokens with explicit 3D (temporal, height, width) structure. MM‑RoPE operates by partitioning the input channel space into smaller “meta” groups, each assigned a balanced set of frequencies across time, height, and width dimensions. Within each group, the rotary embeddings are computed for the corresponding positional index (τₜ, τ_h, τ_w), scaling to support both video and textual modalities in a unified token sequence.

The MM‑RoPE scheme differs from naive 1D or 3D rotary embedding approaches by explicitly calibrating the frequency spectrum to avoid overrepresentation of any single dimension, resulting in improved modeling of long-range spatiotemporal dependencies. The design maintains full compatibility with pretraining on text and enables plug-in extension to visual-textual or multimodal corpora.

2. Token Dependency: Intra-Frame Bidirectionality and Inter-Frame Causality

A central challenge in autoregressive video generation is the need to respect the distinct statistical properties of spatial (within-frame) and temporal (across-frame) dependencies. Lumos-1 addresses this by enforcing:

  • Intra-frame bidirectionality: All tokens within a given video frame attend to each other in a fully bidirectional (non-causal) manner, reflecting the spatial coherence of image modeling.
  • Inter-frame temporal causality: Tokens of a given frame attend only to tokens from earlier frames (not future ones), ensuring strict temporal causality during sequence generation.

This hybrid dependency pattern is implemented in the attention mask, so that the model can simultaneously produce spatially consistent frames and temporally coherent video sequences. Such a strategy preserves the high-fidelity image generation benefits of bidirectionality while ensuring that video progression respects real-world causality.

3. Autoregressive Discrete Diffusion Forcing (AR-DF) and Temporal Tube Masking

A noted problem in prior autoregressive video generation systems is frame-wise loss imbalance: later video frames, conditioned on more context, become easier to predict than earlier frames, and spatial redundancy makes it trivial for the model to copy information across frames, reducing effective learning of temporal propagation.

Lumos-1 introduces AR‑DF (“Autoregressive Discrete Diffusion Forcing”), a masking-based objective directly inspired by discrete diffusion. During training, a temporal tube mask is sampled: a spatial mask (shared across all frames) is constructed using a Bernoulli distribution (with masking ratio ρ), and the masked visual tokens in each frame are replaced by special [MASK] tokens. During inference, a compatible masking policy is adopted, partially revealing generated frames with masking ratio ρ₍inf₎.

Formally, for each location i in every frame t: MiBernoulli(1ρ)M_i \sim \operatorname{Bernoulli}(1-\rho)

X~v(t)=MXv(t)+(1M)[MASK]\tilde{X}_v^{(t)} = M \odot X_v^{(t)} + (1 - M) \odot [\text{MASK}]

The input sequence is then

Xtext,X~v(1),,X~v(T)X_\text{text}, \tilde{X}_v^{(1)}, \ldots, \tilde{X}_v^{(T)}

The cross-entropy loss is computed only on unmasked positions: L(X^,X,M)=i:Mi=1CE(X^i,Xi)L(\hat{X}, X, M) = \sum_{i: M_i=1} \operatorname{CE}(\hat{X}_i, X_i) AR‑DF, particularly via temporal tube masking, equalizes the training difficulty across frames, discourages shortcut learning (leaking spatial details), and forces the model to genuinely learn both spatial structure and temporal propagation.

4. Memory-Efficient Training and Implementation Considerations

Recognizing the computational demands of video generation, Lumos-1 incorporates several strategies to enhance memory efficiency:

  • Flash attention: For scalable long-sequence attention computation.
  • Chunked cross-entropy loss: To enable practical training with large codebooks and long sequence lengths.
  • Efficient distributed pretraining, as demonstrated by successful training with only 48 GPUs—substantially fewer than prior art with similar capacity.

These design choices permit wide adoption in scenarios with constrained hardware, and facilitate rapid exploration of architectural variants.

5. Empirical Performance and Benchmarking

Lumos-1 achieves results comparable to leading autoregressive video generation models. Benchmarks reported in the source include:

  • Text-to-image (GenEval): Performance parallels that of state-of-the-art models such as SD-XL and FLUX.
  • Text-to-video and image-to-video (VBench-T2V, VBench-I2V): Comparable to EMU3, COSMOS-Video2World, and OpenSoraPlan.
  • Hardware efficiency: Training and inference efficiency is emphasized, with state-of-the-art results obtained using a modest computational budget.

This suggests that the architectural and objective design choices in Lumos-1 yield competitive visual quality and temporal consistency without reliance on computational overprovisioning.

6. Technical Details: MM‑RoPE and Attention Mechanisms

At the core of Lumos-1 is MM‑RoPE. For a given query or key vector xmx_m at position mm, with rotation matrix RΘ,mR_{\Theta, m},

fq,k(xm,m)=RΘ,mWq,kxmf_{q,k}(x_m, m) = R_{\Theta, m} \cdot W_{q,k} x_m

where RΘ,mR_{\Theta, m} is a block-diagonal matrix of 2×22 \times 2 rotations: RΘ,m=diag(Rθ1,m,,Rθd/2,m)R_{\Theta, m} = \operatorname{diag}(R_{\theta_1, m}, \ldots, R_{\theta_{d/2}, m})

Rθ,m=[cos(mθ)sin(mθ) sin(mθ)cos(mθ)]R_{\theta, m} = \begin{bmatrix} \cos(m\theta) & -\sin(m\theta) \ \sin(m\theta) & \cos(m\theta)\end{bmatrix}

In the multimodal (image/video) case, attention is computed as the sum of rotations applied independently to the channel partitions associated with temporal, height, and width indices.

The temporal tube mask is implemented by sampling a single binary mask per spatial location and applying it uniformly over all frames, strengthening the spatial-temporal learning signal. The AR‑DF loss clarifies which positions are to be predicted in each sequence.

7. Applications and Future Directions

Lumos-1’s design allows direct adaptation to a wide variety of video synthesis scenarios. The model is directly applicable to text-to-video, image-to-video, or hybrid video generation tasks, and its unified architecture supports further extension to additional modalities or interaction scenarios (for example, video editing with text prompts).

Future research directions identified in the source include further scaling, improved multimodal pretraining, and deployment for interactive or real-time video synthesis. The MM‑RoPE formulation may be generalized to broader multimodal modeling tasks requiring spatiotemporal structure beyond video.


Lumos-1 represents an overview of LLM principles and autoregressive video generation, with distinctive contributions in multimodal positional encoding, dependency structure, robust masking objectives, and memory efficiency. These advances provide a clear foundation for current and future research in unified, scalable video generation and multimodal modeling (2507.08801).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)