MAGI-1: Autoregressive Video Generation at Scale (2505.13211v1)

Published 19 May 2025 in cs.CV and cs.AI

Abstract: We present MAGI-1, a world model that generates videos by autoregressively predicting a sequence of video chunks, defined as fixed-length segments of consecutive frames. Trained to denoise per-chunk noise that increases monotonically over time, MAGI-1 enables causal temporal modeling and naturally supports streaming generation. It achieves strong performance on image-to-video (I2V) tasks conditioned on text instructions, providing high temporal consistency and scalability, which are made possible by several algorithmic innovations and a dedicated infrastructure stack. MAGI-1 facilitates controllable generation via chunk-wise prompting and supports real-time, memory-efficient deployment by maintaining constant peak inference cost, regardless of video length. The largest variant of MAGI-1 comprises 24 billion parameters and supports context lengths of up to 4 million tokens, demonstrating the scalability and robustness of our approach. The code and models are available at https://github.com/SandAI-org/MAGI-1 and https://github.com/SandAI-org/MagiAttention. The product can be accessed at https://sand.ai.

Summary

The paper introduces Magi-1’s novel chunk-wise autoregressive diffusion framework that efficiently predicts fixed-length video chunks with causal temporal modeling.
It employs a Transformer-based VAE and a modified Diffusion Transformer with block-causal and parallel attention mechanisms to optimize inference speed and stability.
The system unifies T2V, I2V, and video continuation through multi-task training and advanced infrastructure, achieving improved fidelity and controllability.

Magi-1 (2505.13211) is presented as a large-scale, autoregressive video generation model that operates by predicting a sequence of video chunks. It is designed to overcome limitations in existing large-scale video diffusion models, particularly concerning causal temporal modeling, streaming generation, and efficient inference for long videos. The paper details Magi-1's architecture, training methodology, inference strategies, data pipeline, and infrastructure, demonstrating its practical application in high-fidelity, controllable video synthesis across tasks like image-to-video (I2V), text-to-video (T2V), and video continuation.

Core Approach: Chunk-wise Autoregressive Denoising

Unlike models that generate an entire video sequence simultaneously, Magi-1 generates video in fixed-length chunks (e.g., 24 frames). The model autoregressively predicts the next chunk conditioned on all previously generated chunks. This process leverages a diffusion-based denoising objective but applies it chunk-wise with noise levels that monotonically decrease over time for subsequent chunks. This design naturally supports causal temporal modeling and streaming generation, where processing of the next chunk can begin before the current one is fully denoised, enabling a pipeline approach (as illustrated in Figure 1).

Model Architecture

Magi-1 operates in a compressed latent space obtained via a Transformer-based Variational Auto-Encoder (VAE).

Transformer-based VAE: To enhance efficiency compared to traditional convolutional VAEs, Magi-1 uses a transformer architecture for its VAE (Figure 2). The encoder downsamples input video/image frames using a 3D convolution and transformer blocks, projecting to a latent space. The decoder mirrors this structure, using pixel shuffling and 3D convolution to reconstruct the pixel space. Training involves two stages: fixed-resolution short clips, followed by variable spatial resolutions and aspect ratios with joint image/video data. Losses include L1, KL divergence, LPIPS, and GAN loss. Inference uses a sliding window approach for arbitrary resolutions. This VAE achieves fast decoding times (Table 1) despite a larger parameter count, crucial for efficient inference.
Auto-Regressive Denoising Model: Built on the Diffusion Transformer (DiT) architecture (Figure 3), Magi-1 incorporates several modifications tailored for autoregressive video generation:
- Block-Causal Attention: Full attention within chunks and causal attention across chunks is enforced using a learnable 3D RoPE positional embedding and a custom Flexible-Flash-Attention kernel built on FlashAttention-3 to handle flexible mask patterns.
- Parallel Attention Block: Spatial-temporal self-attention and cross-attention are parallelized to reduce Tensor Parallel communication overhead.
- QK-Norm and GQA: Normalizing queries and keys improves stability, and Grouped-Query Attention (GQA) reduces memory.
- Sandwich Normalization in FFN: LayerNorm is added before and after Feed-Forward Networks (FFNs) for stability, especially in larger models.
- SwiGLU: Used in the 24B model's FFN for improved performance.
- Softcap Modulation: Applies a Softcap to timestep-based scaling factors from AdaLN to prevent instability in large models.

Training and Distillation

Training Objective: Flow-matching is used, training the model to predict the velocity field towards the clean data. The key difference from bidirectional models is the monotonic increase of noise timesteps $t_i$ across chunks ( $t_i < t_j$ for $i < j$ ), and conditioning chunk $i$ on preceding chunks $j < i$ .
Training Recipes: Training is done in three stages (Table 5) with increasing resolution and video length (up to 720p and 16s). Image and video data are used jointly. A specific Logit-Normal based timestep sampler (Figure 4) is used, biased towards lower noise levels ( $t < 0.3$ ). Clean chunks from training data are handled specially: not conditioned on text, injected with a small amount of noise (5%) to mitigate exposure bias, and loss is only applied to noisy chunks (though clean chunks participate in attention).
Multi-Task Training: The autoregressive framework unifies T2V, I2V, and video continuation by simply adjusting the proportion of clean chunks in the input data during training. I2V is treated as continuation with only the first frame of the first chunk being clean (Figure 5). Chunk-wise text control is enabled by providing different text conditions for each chunk, supported by an auto-regressive captioning strategy (Table 4).
Distillation: A shortcut model (2410.12557) is used to reduce inference steps (e.g., from 64 to 8). This involves training the model to predict the velocity field conditioned on the desired step size $s$ . Distillation targets are bootstrapped, and classifier-free guidance (CFG) distillation (2007.12598) is incorporated.

Inference Strategies

Diffusion Guidance: An extended CFG formulation (Equation 5) incorporates guidance from both text $c_{\text{text}}$ and preceding chunks $x_{<i}$ . A higher weight $w_{\text{prev}} > 1$ for the temporal context guidance helps maintain consistency between chunks and mitigate flickering artifacts (Figure 6).
Inference Timestep Sampler: A tuned power transformation based on the training sampler improves visual quality (Figure 7).
Fine-Grained Control: Guidance strengths ( $w_{\text{prev}}, w_{\text{text}}$ ) are dynamically adjusted during denoising (e.g., reduced for $t > 0.3$ ) to prevent artifacts like saturation in long videos (Figure 8). A similar approach is applied to the distilled model.
KV Cache: Leveraging the autoregressive nature, KV cache stores features of processed chunks, avoiding recomputation. Constraining the KV range (e.g., to 8) limits the context length, enabling long video generation with constant peak memory. Dynamically adjusting the KV range at different denoising stages allows for controllable shot transitions, preserving identity or layout (Figure 9).
Prompt-Enhancement (PE): For tasks like I2V, an MLLM analyzes input images and predicts temporal evolution to create detailed prompts. This process is distilled into a smaller, efficient model (\textasciitilde7B) for lightweight deployment.

Data Pipeline

A large-scale data processing system curates training data from raw videos and images. The pipeline (Figure 4) involves:

Shot Cutting: Using PySceneDetect.
Initial Filters: Removing low-quality or duplicate data (Video Quality Assessment, Aesthetics, Over/Underexposed, Motion Strength (using RAFT (2008.00219) and saliency (1904.02861)), Camera Movement Stability, Slides Movement, Border/Text/Logo/Corner Face Detection, Transition Detection).
De-duplication: Using CLIP (2103.00020) and DINOv2 (2304.07193) similarity.
MLLM as Advanced Filter: Further filtering complex bad cases.
Captioning: Using an MLLM to generate Highly Descriptive Captions for images/videos (Table 4) and Auto-Regressive Captions (second-by-second descriptions focusing on changes) for video (Table 4) to enable chunk-wise control.
Data Adjustment: Multi-stage adjustment (increasing resolution/duration/quality, Table 5) and dynamic distribution adjustment based on monitoring model performance during training (Section 3.3).

Infrastructure

Efficient distributed training and inference infrastructure were developed:

Training Infrastructure: Uses DP, CP, and TP. Addresses challenges of variable-length video data and complex attention masks.
- Distributed Packing and Padding (PnP): Online PnP using a greedy algorithm extends FFD (0707.3186) for efficient batching of variable-length sequences in a distributed setting, achieving high capacity utilization (99%).
- MagiAttention: Scalable distributed attention for ultra-long contexts (up to 4M tokens) and heterogeneous masks. Features Flex-Flash-Attention (FFA) with AttnSlice formulation (Figure 11) for mask flexibility, a greedy dispatch solver (Algorithm 1) for computation load balancing (Figure 13), Zero-Redundant Communication primitives (group-cast/reduce, Figure 15) built on all-to-all-v, and Adaptive Multi-Stage Overlap (Algorithm 2, Figure 16) to hide communication latency. Benchmarks (Figures 17-22) show linear scalability.
- Future Framework Design: A blueprint based on PyTorch DTensor (2410.06511) and Parallel Plan is proposed to decouple modeling from parallelization and enable high-precision alignment with non-distributed baselines (Table 9).
Inference Infrastructure: Tailored for real-time streaming (H100/H800) and cost-effective deployment (RTX 4090).
- Real-Time Streaming: Optimizes Time to First Chunk (TTFC) and Time Per Output Chunk (TPOC). Uses a Multi-Model Heterogeneous Serving Pipeline (co-locating T5/Magi-1 and VAE). TPOC is optimized using W8A8 SmoothQuant (2306.10251) quantization (30% speedup, Appendix A.1) and Multi-Node Parallel Inference (Ulysses-based CP with aggressive overlap, Appendix A.2). TTFC is optimized with CUDA Graphs and VAE decoding acceleration (tile-based, torch.compile). Achieves 2.3s TTFC and <1s TPOC on 24 H100 GPUs (Table 6).
- Cost-effective on RTX 4090: Addresses memory insufficiency using Quantization, KV-offload (storing KV cache in CPU memory), and Hybrid Parallelism (PP for weights, CP for activations). Introduces Context Shuffle Overlap (CSO, Figure 23, Appendix A.3) to improve MFU under low PCIe bandwidth by scattering chunks across GPUs for finer-grained overlap. Achieves 19GB peak memory usage and 66% MFU for the 24B model on 8x RTX 4090s.

Evaluation

Evaluation covers perceptual quality and physical understanding:

Perceptual Evaluation:
- In-house Human Evaluation: Uses a hierarchical metric system (Overall, Motion Quality, Instruction Following, Visual Quality) and a diverse 100-sample I2V dataset. Double-blind paired comparison against models like Kling (2405.04233), Hailuo [minimax2024hailuo], Wan (2503.20314), HunyuanVideo (2412.03603). Magi-1 performs strongly overall, particularly in instruction following and motion quality, while having room for improvement in visual quality (Figure 7).
- VBench-I2V (2411.13503): Automated evaluation focusing on I2V. Magi-1 (with/without 2x VAE upsampling) achieves top scores (Table 7), excelling in dynamic degree and motion smoothness while maintaining visual quality, addressing a common trade-off.
Physical Evaluation:
- Physics-IQ Benchmark (2501.09038): Evaluates understanding of physical dynamics by predicting future frames from a prefix video. Magi-1 significantly outperforms other models in V2V prediction, even when conditioned only on an image (Table 8). The autoregressive nature is credited for better causal modeling. Case studies (Figure 10) show strong primary dynamics but limitations with complex secondary effects. Performance improves with increased historical context (KV range) in V2V (Figure 11).

Limitations and Future Work

The main limitation identified is the tightly coupled architecture where a single transformer handles both high-level context fusion and low-level denoising. This leads to potential inference bottlenecks, optimization conflicts, and limited fine-grained controllability. Future work suggests exploring a decoupled design that separates semantic reasoning from visual synthesis, potentially bridging video generation and understanding towards a "world model".

Contributions

Magi-1 contributes a scalable chunk-wise autoregressive diffusion framework, unifying T2V, I2V, and video continuation under causal constraints. It enables real-time streaming and long-horizon generation through efficient inference techniques and is supported by a novel data pipeline and distributed infrastructure, including the MagiAttention mechanism. Empirical results establish it as a strong performer in fidelity, controllability, and physical plausibility.

PDF Markdown

GitHub

Tweets

https://twitter.com/SandAI_HQ/status/1928480659611189356

MAGI-1: Autoregressive Video Generation at Scale (2505.13211v1)

Summary

Related Papers

GitHub

Tweets