MarDini: Masked Autoregressive Diffusion for Video Generation at Scale
Abstract: We introduce MarDini, a new family of video diffusion models that integrate the advantages of masked auto-regression (MAR) into a unified diffusion model (DM) framework. Here, MAR handles temporal planning, while DM focuses on spatial generation in an asymmetric network design: i) a MAR-based planning model containing most of the parameters generates planning signals for each masked frame using low-resolution input; ii) a lightweight generation model uses these signals to produce high-resolution frames via diffusion de-noising. MarDini's MAR enables video generation conditioned on any number of masked frames at any frame positions: a single model can handle video interpolation (e.g., masking middle frames), image-to-video generation (e.g., masking from the second frame onward), and video expansion (e.g., masking half the frames). The efficient design allocates most of the computational resources to the low-resolution planning model, making computationally expensive but important spatio-temporal attention feasible at scale. MarDini sets a new state-of-the-art for video interpolation; meanwhile, within few inference steps, it efficiently generates videos on par with those of much more expensive advanced image-to-video models.
- GPT-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- Layer normalization. ArXiv, abs/1607.06450, 2016.
- Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023a.
- Align your latents: High-resolution video synthesis with latent diffusion models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023b.
- Video generation models as world simulators, 2024. URL https://openai.com/research/video-generation-models-as-world-simulators, 2024.
- Tom B Brown. Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020.
- MaskGiT: Masked generative image transformer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
- Muse: Text-to-image generation via masked generative transformers. Proceedings of the International Conference on Machine Learning (ICML), 2023.
- Videocrafter2: Overcoming data limitations for high-quality video diffusion models. arXiv preprint arXiv:2401.09047, 2024a.
- Pixart-\\\backslash\sigma: Weak-to-strong training of diffusion transformer for 4k text-to-image generation. arXiv preprint arXiv:2403.04692, 2024b.
- Gentron: Diffusion transformers for image and video generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2024c.
- Seine: Short-to-long video diffusion model for generative transition and prediction. In Proceedings of the International Conference on Learning Representations (ICLR), 2023.
- FLATTEN: optical flow-guided attention for consistent text-to-video editing. In Proceedings of the International Conference on Learning Representations (ICLR), 2024.
- Emu: Enhancing image generation models using photogenic needles in a haystack. arXiv preprint arXiv:2309.15807, 2023a.
- Fine-grained open domain image animation with motion guidance. arXiv preprint arXiv:2311.12886, 2023b.
- LDMVFI: Video frame interpolation with latent diffusion models. In Proceedings of the National Conference on Artificial Intelligence (AAAI), 2024.
- Bert: Pre-training of deep bidirectional transformers for language understanding. In Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Long and Short Papers), 2019.
- Diffusion models beat GANs on image synthesis. In Advances in Neural Information Processing Systems (NeurIPS), 2021.
- Google scanned objects: A high-quality dataset of 3d scanned household items. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), 2022.
- The LLAMA 3 herd of models. arXiv preprint arXiv:2407.21783, 2024.
- Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
- Structure and content-guided video synthesis with diffusion models. In Proceedings of the International Conference on Computer Vision (ICCV), 2023.
- Lumina-t2x: Transforming text into any modality, resolution, and duration via flow-based large diffusion transformers. arXiv preprint arXiv:2405.05945, 2024.
- Learning to forget: Continual prediction with LSTM. Neural computation, 2000.
- Emu video: Factorizing text-to-video generation by explicit image conditioning. arXiv preprint arXiv:2311.10709, 2023.
- Masked autoencoders are scalable vision learners. arXiv preprint arXiv:2111.06377, 2021.
- GANs trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems (NeurIPS), volume 30, 2017.
- Classifier-free diffusion guidance. In Advances in Neural Information Processing Systems (NeurIPS) Workshop, 2022.
- Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems (NeurIPS), 2020.
- Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022.
- Long short-term memory. Neural Computation MIT-Press, 9(8):1735–1780, 1997.
- Arbitrary style transfer in real-time with adaptive instance normalization. In Proceedings of the International Conference on Computer Vision (ICCV), pages 1501–1510, 2017.
- Real-time intermediate flow estimation for video frame interpolation. In Proceedings of the European Conference on Computer Vision (ECCV), 2022.
- VBench: Comprehensive benchmark suite for video generative models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2024.
- Video interpolation with diffusion models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2024.
- Christopher Jarzynski. Equilibrium free-energy differences from nonequilibrium measurements: A master-equation approach. Physical Review E, 1997.
- Scaling up GANs for text-to-image synthesis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
- Tero Karras. Progressive growing of GANs for improved quality, stability, and variation. In Proceedings of the International Conference on Learning Representations (ICLR), 2018.
- Auto-encoding variational Bayes. In Proceedings of the International Conference on Learning Representations (ICLR), 2014.
- Videopoet: A large language model for zero-shot video generation. arXiv preprint arXiv:2312.14125, 2023.
- Mage: Masked generative encoder to unify representation learning and image synthesis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023a.
- Autoregressive image generation without vector quantization. arXiv preprint arXiv:2406.11838, 2024.
- Amt: All-pairs multi-field transforms for efficient frame interpolation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023b.
- Faster diffusion via temporal attention decomposition. arXiv e-prints, pages arXiv–2404, 2024.
- Snap video: Scaled spatiotemporal transformers for text-to-video synthesis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2024.
- Openvid-1m: A large-scale high-quality dataset for text-to-video generation. arXiv preprint arXiv:2407.02371, 2024.
- Radford M Neal. Annealed importance sampling. Statistics and computing, 2001.
- Scalable diffusion models with transformers. In Proceedings of the International Conference on Computer Vision (ICCV), 2023.
- Rwkv: Reinventing RNNs for the transformer era. arXiv preprint arXiv:2305.13048, 2023.
- Würstchen: An efficient architecture for large-scale text-to-image diffusion models. Proceedings of the International Conference on Learning Representations (ICLR), 2024.
- SDXL: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023.
- The 2017 davis challenge on video object segmentation. arXiv preprint arXiv:1704.00675, 2017.
- Zero-shot text-to-image generation. In Proceedings of the International Conference on Machine Learning (ICML), 2021.
- Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
- Generating diverse high-fidelity images with VQ-VAE-2. Advances in Neural Information Processing Systems (NeurIPS), 2019.
- Film: Frame interpolation for large motion. In European Conference on Computer Vision, pages 250–266. Springer, 2022.
- Consisti2v: Enhancing visual consistency for image-to-video generation. arXiv preprint arXiv:2402.04324, 2024.
- High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
- Denoising diffusion probabilistic models for robust image super-resolution in the wild. arXiv preprint arXiv:2302.07864, 2023.
- Photorealistic text-to-image diffusion models with deep language understanding. In Advances in Neural Information Processing Systems (NeurIPS), 2022a.
- Image super-resolution via iterative refinement. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 2022b.
- Progressive distillation for fast sampling of diffusion models. In Proceedings of the International Conference on Learning Representations (ICLR), 2022.
- Linear transformers are secretly fast weight programmers. In Proceedings of the International Conference on Machine Learning (ICML). PMLR, 2021.
- Jürgen Schmidhuber. Learning complex, extended sequences using the principle of history compression. Neural computation, 4(2):234–242, 1992a.
- Jürgen Schmidhuber. Learning to control fast-weight memories: An alternative to dynamic recurrent networks. Neural Computation, 4(1):131–139, 1992b.
- Jürgen Schmidhuber. Deep learning in neural networks: An overview. Neural networks, 2015.
- Denoising diffusion implicit models. In Proceedings of the International Conference on Learning Representations (ICLR), 2021.
- UCF101: A dataset of 101 human actions classes from videos in the wild. ArXiv, abs/1212.0402, 2012.
- Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 2024.
- Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models. arXiv preprint arXiv:2405.09818, 2024.
- Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
- LLAMA 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:1812.01717, 2018.
- Neural discrete representation learning. Advances in Neural Information Processing Systems (NeurIPS), 2017.
- Attention is all you need. Advances in Neural Information Processing Systems (NeurIPS), 2017.
- MCVD-masked conditional video diffusion for prediction, generation, and interpolation. Advances in Neural Information Processing Systems (NeurIPS), 2022.
- Deepnet: Scaling transformers to 1,000 layers. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 2024a.
- Modelscope text-to-video technical report. arXiv preprint arXiv:2308.06571, 2023.
- Magicvideo-v2: Multi-stage high-aesthetic video generation. arXiv preprint arXiv:2401.04468, 2024b.
- Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing, 13(4):600–612, 2004.
- Novel view synthesis with diffusion models. In Proceedings of the International Conference on Learning Representations (ICLR), 2023.
- Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. In Proceedings of the International Conference on Computer Vision (ICCV), 2023.
- Show-o: One single transformer to unify multimodal understanding and generation. arXiv preprint arXiv:2408.12528, 2024.
- Dynamicrafter: Animating open-domain images with video diffusion priors. In Proceedings of the European Conference on Computer Vision (ECCV), 2024.
- Low-fidelity video encoder optimization for temporal action localization. Advances in Neural Information Processing Systems (NeurIPS), 2021.
- Vector-quantized image modeling with improved vqgan. Proceedings of the International Conference on Learning Representations (ICLR), 2021.
- Scaling autoregressive models for content-rich text-to-image generation. arXiv preprint arXiv:2206.10789, 2022.
- MagViT: Masked generative video transformer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023a.
- Language model beats diffusion–tokenizer is key to visual generation. Proceedings of the International Conference on Learning Representations (ICLR), 2024.
- Scaling autoregressive multi-modal models: Pretraining and instruction tuning. arXiv preprint arXiv:2309.02591, 2023b.
- Root mean square layer normalization. Advances in Neural Information Processing Systems (NeurIPS), 2019.
- Show-1: Marrying pixel and latent diffusion models for text-to-video generation. arXiv preprint arXiv:2309.15818, 2023a.
- The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 586–595, 2018.
- I2vgen-xl: High-quality image-to-video synthesis via cascaded diffusion models. arXiv preprint arXiv:2311.04145, 2023b.
- Real-time video generation with pyramid attention broadcast. arXiv preprint arXiv:2408.12588, 2024.
- Pytorch FSDP: experiences on scaling fully sharded data parallel. arXiv preprint arXiv:2304.11277, 2023.
- Transfusion: Predict the next token and diffuse images with one multi-modal model. arXiv preprint arXiv:2408.11039, 2024.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.