- The paper introduces a unified LLM-based model that seamlessly integrates text and visual tokens for text-to-image, image-to-video, and text-to-video generation.
- The paper's novel MM-RoPE mechanism interleaves temporal, height, and width encodings to enhance spatiotemporal modeling and cross-modal alignment.
- The paper employs AR-DF strategies with temporal tube and partial context masking to optimize training and inference for efficient, high-quality video synthesis.
Lumos-1: A Unified Autoregressive Model for Video Generation
LumosGen presents a unified approach to autoregressive video generation, leveraging the architectural principles of LLMs with minimal modifications. The work addresses key challenges in extending LLMs to the spatiotemporal domain, proposing novel mechanisms for position encoding and training dynamics that are critical for high-fidelity, efficient video synthesis.
Model Architecture and Unified Tokenization
LumosGen adopts the Llama architecture as its backbone, integrating both language and visual modalities into a single transformer. The model utilizes a unified discrete codebook for text and visual tokens, enabling seamless handling of text-to-image, image-to-video, and text-to-video tasks within a single framework. This design choice eliminates the need for external text encoders, reducing system complexity and inference latency.
Implementation Details:
- Tokenization: Text tokens use Chameleon's tokenizer; visual tokens are discretized via a VQ-based tokenizer. The codebook is partitioned (65,536 text, 64,000 visual tokens).
- Sequence Formatting: Text and visual tokens are interleaved, with text tokens encoding metadata (prompt, resolution, fps, frame count). This supports variable aspect ratios and flexible conditioning.
- Architecture: RMSNorm, SwiGLU, and QK-Norm are incorporated for stability and efficiency. Model sizes range from 0.5B to 3.6B parameters.
Spatiotemporal Position Encoding: MM-RoPE
A central contribution is the Multi-Modal Rotary Position Embedding (MM-RoPE), designed to inject spatiotemporal priors into the transformer attention mechanism. Standard 1D RoPE, effective for text, is insufficient for video due to the need to model temporal and spatial dependencies jointly.
Key Innovations:
- Distributed Frequency Allocation: MM-RoPE interleaves temporal, height, and width positional encodings across the embedding space, ensuring all dimensions receive a comprehensive frequency spectrum. This contrasts with prior 3D RoPE variants, which often underrepresent spatial axes.
- Scaling for Modality Balance: The positional indices for visual tokens are empirically scaled to match the range of text tokens, improving cross-modal alignment and preventing underfitting of visual context.
- Plug-and-Play for Unified Models: MM-RoPE preserves the original RoPE for text tokens, maintaining LLMing capacity while enhancing visual generation.
Practical Implications:
- MM-RoPE introduces negligible inference overhead (<5% latency increase).
- Ablation studies confirm that distributed frequency allocation is the dominant factor for improved convergence and generation quality.
Autoregressive Discrete Diffusion Forcing (AR-DF)
To address the inefficiency and loss imbalance inherent in next-token prediction for video, LumosGen introduces AR-DF, a training and inference strategy inspired by discrete diffusion models.
Training:
- Temporal Tube Masking: Instead of random masking, a mask pattern is sampled for the first frame and repeated across all frames. This prevents spatial information leakage and ensures that later frames are not trivially predictable from earlier ones.
- Loss Computation: Cross-entropy loss is computed only on unmasked tokens, focusing learning on genuinely uncertain regions.
Inference:
- Partial Context Masking: At each step, a portion of the generated frame is masked (with a tunable ratio), mirroring the training regime. This maintains consistency between training and inference, preventing quality and motion degradation.
- Efficient Decoding: The approach supports parallel decoding within frames (intra-frame bidirectionality) and sequential decoding across frames (inter-frame causality), balancing quality and efficiency.
Implementation Considerations:
- Memory Efficiency: Flash attention and chunked cross-entropy are used to manage the large codebook and long sequences.
- Stage-wise Training: The model is first trained on text-to-image, then on joint image/video data, and finally on higher resolutions, ensuring stable convergence.
Empirical Results
LumosGen demonstrates strong performance across multiple benchmarks:
- Text-to-Image (GenEval): Achieves results on par with or exceeding diffusion models of similar or larger size, particularly in position and attribute binding.
- Image-to-Video (VBench-I2V): Matches or surpasses leading models, despite using significantly less data and compute.
- Text-to-Video (VBench-T2V): Comparable to state-of-the-art diffusion models, with robust object-centric metrics.
Notably, LumosGen achieves these results with only 48 GPUs, highlighting the efficiency of the unified architecture and training scheme.
Analysis and Ablations
- MM-RoPE: Increasing the number of meta MM-RoPE groups (i.e., finer frequency slicing) consistently improves spatiotemporal modeling.
- Scaling Factors: Moderate scaling of positional indices (e.g., (4,8,8) for temporal, height, width) suffices; excessive scaling yields diminishing returns.
- AR-DF Mask Ratio: Inference quality is robust for mask ratios between 0.3 and 0.7; outside this range, either context is insufficient or temporal continuity is disrupted.
- Aspect Ratio Robustness: The model generalizes well to unseen aspect ratios due to the unified codebook and flexible sequence formatting.
Implications and Future Directions
Practical Implications:
- Unified Multimodal Generation: LumosGen's architecture supports flexible conditioning and multi-task generation without architectural changes or external encoders.
- Efficient Training and Inference: The combination of MM-RoPE and AR-DF enables high-quality video synthesis with manageable compute and memory requirements.
- Scalability: The approach is amenable to larger datasets and model sizes, with clear paths for scaling.
Theoretical Implications:
- Position Encoding in Multimodal Transformers: The work highlights the importance of frequency allocation and modality balancing in position encoding for unified models.
- Training Dynamics in Autoregressive Video Models: Temporal tube masking and partial context inference address fundamental challenges in loss balancing and temporal coherence.
Future Developments:
- Scaling Data and Model Size: Expanding the training corpus and model capacity is expected to further improve generalization, especially for complex dynamics and rare scenarios.
- Multimodal Knowledge Infusion: Integrating pretrained vision-LLMs or co-training with understanding tasks could enhance grounding and compositionality.
- Advanced Position Encoding: More sophisticated scaling or adaptive position encoding strategies may yield further gains in cross-modal alignment.
Conclusion
LumosGen establishes a practical and theoretically grounded framework for unified autoregressive video generation. By addressing spatiotemporal position encoding and training-inference consistency, it demonstrates that LLM architectures can be effectively extended to high-quality, efficient video synthesis. The proposed techniques—MM-RoPE and AR-DF—are broadly applicable to future unified multimodal models, with significant implications for both research and deployment in generative AI.