Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MarDini: Masked Autoregressive Diffusion for Video Generation at Scale (2410.20280v1)

Published 26 Oct 2024 in cs.CV and cs.AI

Abstract: We introduce MarDini, a new family of video diffusion models that integrate the advantages of masked auto-regression (MAR) into a unified diffusion model (DM) framework. Here, MAR handles temporal planning, while DM focuses on spatial generation in an asymmetric network design: i) a MAR-based planning model containing most of the parameters generates planning signals for each masked frame using low-resolution input; ii) a lightweight generation model uses these signals to produce high-resolution frames via diffusion de-noising. MarDini's MAR enables video generation conditioned on any number of masked frames at any frame positions: a single model can handle video interpolation (e.g., masking middle frames), image-to-video generation (e.g., masking from the second frame onward), and video expansion (e.g., masking half the frames). The efficient design allocates most of the computational resources to the low-resolution planning model, making computationally expensive but important spatio-temporal attention feasible at scale. MarDini sets a new state-of-the-art for video interpolation; meanwhile, within few inference steps, it efficiently generates videos on par with those of much more expensive advanced image-to-video models.

MarDini: Leveraging Masked Autoregressive Diffusion for Advanced Video Generation

MarDini introduces a novel family of video generation models that synergistically combines masked autoregressive methods with diffusion models (DM) to address the complex challenges inherent in video generation at scale. This work presents an innovative architectural paradigm where the temporal complexities of video tasks are managed through Masked Auto-Regression (MAR), while the detailed spatial aspects are refined using diffusion models. Herein lies MarDini’s core innovation: an efficient asymmetric neural architecture designed to optimize both computational resources and model performance.

Framework and Key Components

The MarDini framework is bifurcated into two distinct functional units: a planning model based on MAR and a generation model utilizing DMs. This asymmetry is deliberate, focusing the majority of parameters on the MAR component for long-range temporal operations at low resolution, while relegating high-resolution spatial refining to a smaller, less parameter-intensive DM. This facilitates efficient computation, enabling high-quality video generation tasks including, but not limited to, video interpolation, image-to-video generation, and the novel synthesis of video frames.

The MAR component, leveraging bi-directional attention, effectively simulates autoregressive behavior in continuous spatial-temporal operations. It generates robust planning signals that guide the DM, which excels at de-noising and refining high-resolution video frames. This bifurcation of tasks addresses limitations associated with high-dimensional video data that challenge conventional AR models in natural language processing and underscores the adaptability of MarDini across several generative tasks through varying masking strategies.

Empirical Observations and Model Performance

MarDini’s performance, particularly in video interpolation on the VIDIM-Bench, sets a new benchmark, achieving significant improvements in the Fréchet Video Distance (FVD) metric over competitive models. This showcases the model's capability in handling long-range frame prediction tasks with a high degree of temporal coherence and visual fidelity. This efficacy is further underscored by the model’s capacity to deliver competitive performance in image-to-video generation tasks, evidenced by superior results on the VBench dataset.

The adaptability and scalability of MarDini are reflected in its ability to circumvent the need for extensive image-based pre-training, a conventional prerequisite in video generative models. Through incremental task difficulty adjustments during training, MarDini mitigates this limitation, providing an efficient, unified pathway from video interpolation to complete video generation without relying on upstream task pre-training routines.

Architectural Insights and Computational Efficiency

A standout feature of MarDini is its scalability, extending beyond simple generative rigidity to encompass a flexible task framework. Thanks to the hierarchical and autoregressive conditioning mechanisms, MarDini generates extended video sequences from minimal inputs, including autonomous video interpolation and expansion capabilities.

The computational efficacy of MarDini is primarily derived from its asymmetric design ethos. Tailoring computational complexities to operate at low resolution during the planning phase allows for the deployment of more sophisticated spatio-temporal attention mechanisms at scale, thereby reducing inference latency without compromising performance. This strategic design allocation mitigates memory constraints and positions MarDini as a memory-efficient model ideally suited for high-resolution video generation tasks.

Future Directions and Implications

The unique combination of MAR with DMs in MarDini presents new horizons for AI research, particularly in scalable video generation applications. Future work could explore the detailed integration of additional conditional signals such as text-based or motion-guided inputs, potentially widening its application scope and enhancing model robustness. Given MarDini's efficient use of computational resources and its scalable design, future iterations might drive advances in real-time video generation and applications across various multimedia platforms.

In conclusion, MarDini represents a promising direction in video generative modeling, transcending typical limitations found in fixed-resolution generation while offering advanced temporal coherence and spatial detailing. It effectively embodies a novel diffusion-based approach to autoregressive video generation that combines flexibility, efficiency, and scalability, marking a significant contribution to the AI generative modeling sphere.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (97)
  1. GPT-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  2. Layer normalization. ArXiv, abs/1607.06450, 2016.
  3. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023a.
  4. Align your latents: High-resolution video synthesis with latent diffusion models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023b.
  5. Video generation models as world simulators, 2024. URL https://openai.com/research/video-generation-models-as-world-simulators, 2024.
  6. Tom B Brown. Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020.
  7. MaskGiT: Masked generative image transformer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
  8. Muse: Text-to-image generation via masked generative transformers. Proceedings of the International Conference on Machine Learning (ICML), 2023.
  9. Videocrafter2: Overcoming data limitations for high-quality video diffusion models. arXiv preprint arXiv:2401.09047, 2024a.
  10. Pixart-\\\backslash\sigma: Weak-to-strong training of diffusion transformer for 4k text-to-image generation. arXiv preprint arXiv:2403.04692, 2024b.
  11. Gentron: Diffusion transformers for image and video generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2024c.
  12. Seine: Short-to-long video diffusion model for generative transition and prediction. In Proceedings of the International Conference on Learning Representations (ICLR), 2023.
  13. FLATTEN: optical flow-guided attention for consistent text-to-video editing. In Proceedings of the International Conference on Learning Representations (ICLR), 2024.
  14. Emu: Enhancing image generation models using photogenic needles in a haystack. arXiv preprint arXiv:2309.15807, 2023a.
  15. Fine-grained open domain image animation with motion guidance. arXiv preprint arXiv:2311.12886, 2023b.
  16. LDMVFI: Video frame interpolation with latent diffusion models. In Proceedings of the National Conference on Artificial Intelligence (AAAI), 2024.
  17. Bert: Pre-training of deep bidirectional transformers for language understanding. In Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Long and Short Papers), 2019.
  18. Diffusion models beat GANs on image synthesis. In Advances in Neural Information Processing Systems (NeurIPS), 2021.
  19. Google scanned objects: A high-quality dataset of 3d scanned household items. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), 2022.
  20. The LLAMA 3 herd of models. arXiv preprint arXiv:2407.21783, 2024.
  21. Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
  22. Structure and content-guided video synthesis with diffusion models. In Proceedings of the International Conference on Computer Vision (ICCV), 2023.
  23. Lumina-t2x: Transforming text into any modality, resolution, and duration via flow-based large diffusion transformers. arXiv preprint arXiv:2405.05945, 2024.
  24. Learning to forget: Continual prediction with LSTM. Neural computation, 2000.
  25. Emu video: Factorizing text-to-video generation by explicit image conditioning. arXiv preprint arXiv:2311.10709, 2023.
  26. Masked autoencoders are scalable vision learners. arXiv preprint arXiv:2111.06377, 2021.
  27. GANs trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems (NeurIPS), volume 30, 2017.
  28. Classifier-free diffusion guidance. In Advances in Neural Information Processing Systems (NeurIPS) Workshop, 2022.
  29. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems (NeurIPS), 2020.
  30. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022.
  31. Long short-term memory. Neural Computation MIT-Press, 9(8):1735–1780, 1997.
  32. Arbitrary style transfer in real-time with adaptive instance normalization. In Proceedings of the International Conference on Computer Vision (ICCV), pages 1501–1510, 2017.
  33. Real-time intermediate flow estimation for video frame interpolation. In Proceedings of the European Conference on Computer Vision (ECCV), 2022.
  34. VBench: Comprehensive benchmark suite for video generative models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2024.
  35. Video interpolation with diffusion models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2024.
  36. Christopher Jarzynski. Equilibrium free-energy differences from nonequilibrium measurements: A master-equation approach. Physical Review E, 1997.
  37. Scaling up GANs for text-to-image synthesis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  38. Tero Karras. Progressive growing of GANs for improved quality, stability, and variation. In Proceedings of the International Conference on Learning Representations (ICLR), 2018.
  39. Auto-encoding variational Bayes. In Proceedings of the International Conference on Learning Representations (ICLR), 2014.
  40. Videopoet: A large language model for zero-shot video generation. arXiv preprint arXiv:2312.14125, 2023.
  41. Mage: Masked generative encoder to unify representation learning and image synthesis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023a.
  42. Autoregressive image generation without vector quantization. arXiv preprint arXiv:2406.11838, 2024.
  43. Amt: All-pairs multi-field transforms for efficient frame interpolation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023b.
  44. Faster diffusion via temporal attention decomposition. arXiv e-prints, pages arXiv–2404, 2024.
  45. Snap video: Scaled spatiotemporal transformers for text-to-video synthesis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2024.
  46. Openvid-1m: A large-scale high-quality dataset for text-to-video generation. arXiv preprint arXiv:2407.02371, 2024.
  47. Radford M Neal. Annealed importance sampling. Statistics and computing, 2001.
  48. Scalable diffusion models with transformers. In Proceedings of the International Conference on Computer Vision (ICCV), 2023.
  49. Rwkv: Reinventing RNNs for the transformer era. arXiv preprint arXiv:2305.13048, 2023.
  50. Würstchen: An efficient architecture for large-scale text-to-image diffusion models. Proceedings of the International Conference on Learning Representations (ICLR), 2024.
  51. SDXL: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023.
  52. The 2017 davis challenge on video object segmentation. arXiv preprint arXiv:1704.00675, 2017.
  53. Zero-shot text-to-image generation. In Proceedings of the International Conference on Machine Learning (ICML), 2021.
  54. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
  55. Generating diverse high-fidelity images with VQ-VAE-2. Advances in Neural Information Processing Systems (NeurIPS), 2019.
  56. Film: Frame interpolation for large motion. In European Conference on Computer Vision, pages 250–266. Springer, 2022.
  57. Consisti2v: Enhancing visual consistency for image-to-video generation. arXiv preprint arXiv:2402.04324, 2024.
  58. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
  59. Denoising diffusion probabilistic models for robust image super-resolution in the wild. arXiv preprint arXiv:2302.07864, 2023.
  60. Photorealistic text-to-image diffusion models with deep language understanding. In Advances in Neural Information Processing Systems (NeurIPS), 2022a.
  61. Image super-resolution via iterative refinement. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 2022b.
  62. Progressive distillation for fast sampling of diffusion models. In Proceedings of the International Conference on Learning Representations (ICLR), 2022.
  63. Linear transformers are secretly fast weight programmers. In Proceedings of the International Conference on Machine Learning (ICML). PMLR, 2021.
  64. Jürgen Schmidhuber. Learning complex, extended sequences using the principle of history compression. Neural computation, 4(2):234–242, 1992a.
  65. Jürgen Schmidhuber. Learning to control fast-weight memories: An alternative to dynamic recurrent networks. Neural Computation, 4(1):131–139, 1992b.
  66. Jürgen Schmidhuber. Deep learning in neural networks: An overview. Neural networks, 2015.
  67. Denoising diffusion implicit models. In Proceedings of the International Conference on Learning Representations (ICLR), 2021.
  68. UCF101: A dataset of 101 human actions classes from videos in the wild. ArXiv, abs/1212.0402, 2012.
  69. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 2024.
  70. Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models. arXiv preprint arXiv:2405.09818, 2024.
  71. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
  72. LLAMA 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  73. Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:1812.01717, 2018.
  74. Neural discrete representation learning. Advances in Neural Information Processing Systems (NeurIPS), 2017.
  75. Attention is all you need. Advances in Neural Information Processing Systems (NeurIPS), 2017.
  76. MCVD-masked conditional video diffusion for prediction, generation, and interpolation. Advances in Neural Information Processing Systems (NeurIPS), 2022.
  77. Deepnet: Scaling transformers to 1,000 layers. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 2024a.
  78. Modelscope text-to-video technical report. arXiv preprint arXiv:2308.06571, 2023.
  79. Magicvideo-v2: Multi-stage high-aesthetic video generation. arXiv preprint arXiv:2401.04468, 2024b.
  80. Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing, 13(4):600–612, 2004.
  81. Novel view synthesis with diffusion models. In Proceedings of the International Conference on Learning Representations (ICLR), 2023.
  82. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. In Proceedings of the International Conference on Computer Vision (ICCV), 2023.
  83. Show-o: One single transformer to unify multimodal understanding and generation. arXiv preprint arXiv:2408.12528, 2024.
  84. Dynamicrafter: Animating open-domain images with video diffusion priors. In Proceedings of the European Conference on Computer Vision (ECCV), 2024.
  85. Low-fidelity video encoder optimization for temporal action localization. Advances in Neural Information Processing Systems (NeurIPS), 2021.
  86. Vector-quantized image modeling with improved vqgan. Proceedings of the International Conference on Learning Representations (ICLR), 2021.
  87. Scaling autoregressive models for content-rich text-to-image generation. arXiv preprint arXiv:2206.10789, 2022.
  88. MagViT: Masked generative video transformer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023a.
  89. Language model beats diffusion–tokenizer is key to visual generation. Proceedings of the International Conference on Learning Representations (ICLR), 2024.
  90. Scaling autoregressive multi-modal models: Pretraining and instruction tuning. arXiv preprint arXiv:2309.02591, 2023b.
  91. Root mean square layer normalization. Advances in Neural Information Processing Systems (NeurIPS), 2019.
  92. Show-1: Marrying pixel and latent diffusion models for text-to-video generation. arXiv preprint arXiv:2309.15818, 2023a.
  93. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 586–595, 2018.
  94. I2vgen-xl: High-quality image-to-video synthesis via cascaded diffusion models. arXiv preprint arXiv:2311.04145, 2023b.
  95. Real-time video generation with pyramid attention broadcast. arXiv preprint arXiv:2408.12588, 2024.
  96. Pytorch FSDP: experiences on scaling fully sharded data parallel. arXiv preprint arXiv:2304.11277, 2023.
  97. Transfusion: Predict the next token and diffuse images with one multi-modal model. arXiv preprint arXiv:2408.11039, 2024.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (15)
  1. Haozhe Liu (36 papers)
  2. Shikun Liu (21 papers)
  3. Zijian Zhou (63 papers)
  4. Mengmeng Xu (27 papers)
  5. Yanping Xie (3 papers)
  6. Xiao Han (127 papers)
  7. Juan C. Pérez (18 papers)
  8. Ding Liu (52 papers)
  9. Kumara Kahatapitiya (20 papers)
  10. Menglin Jia (17 papers)
  11. Jui-Chieh Wu (4 papers)
  12. Sen He (29 papers)
  13. Tao Xiang (324 papers)
  14. Jürgen Schmidhuber (124 papers)
  15. Juan-Manuel Pérez-Rúa (5 papers)
Citations (5)
Youtube Logo Streamline Icon: https://streamlinehq.com
Reddit Logo Streamline Icon: https://streamlinehq.com