Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

80 tokens/sec

GPT-4o

59 tokens/sec

Gemini 2.5 Pro Pro

43 tokens/sec

o3 Pro

7 tokens/sec

GPT-4.1 Pro

50 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

450 3 1

MarDini: Masked Autoregressive Diffusion for Video Generation at Scale (2410.20280v1)

Published 26 Oct 2024 in cs.CV and cs.AI

Abstract: We introduce MarDini, a new family of video diffusion models that integrate the advantages of masked auto-regression (MAR) into a unified diffusion model (DM) framework. Here, MAR handles temporal planning, while DM focuses on spatial generation in an asymmetric network design: i) a MAR-based planning model containing most of the parameters generates planning signals for each masked frame using low-resolution input; ii) a lightweight generation model uses these signals to produce high-resolution frames via diffusion de-noising. MarDini's MAR enables video generation conditioned on any number of masked frames at any frame positions: a single model can handle video interpolation (e.g., masking middle frames), image-to-video generation (e.g., masking from the second frame onward), and video expansion (e.g., masking half the frames). The efficient design allocates most of the computational resources to the low-resolution planning model, making computationally expensive but important spatio-temporal attention feasible at scale. MarDini sets a new state-of-the-art for video interpolation; meanwhile, within few inference steps, it efficiently generates videos on par with those of much more expensive advanced image-to-video models.

PDF HTML Abstract

MarDini: Leveraging Masked Autoregressive Diffusion for Advanced Video Generation

MarDini introduces a novel family of video generation models that synergistically combines masked autoregressive methods with diffusion models (DM) to address the complex challenges inherent in video generation at scale. This work presents an innovative architectural paradigm where the temporal complexities of video tasks are managed through Masked Auto-Regression (MAR), while the detailed spatial aspects are refined using diffusion models. Herein lies MarDini’s core innovation: an efficient asymmetric neural architecture designed to optimize both computational resources and model performance.

Framework and Key Components

The MarDini framework is bifurcated into two distinct functional units: a planning model based on MAR and a generation model utilizing DMs. This asymmetry is deliberate, focusing the majority of parameters on the MAR component for long-range temporal operations at low resolution, while relegating high-resolution spatial refining to a smaller, less parameter-intensive DM. This facilitates efficient computation, enabling high-quality video generation tasks including, but not limited to, video interpolation, image-to-video generation, and the novel synthesis of video frames.

The MAR component, leveraging bi-directional attention, effectively simulates autoregressive behavior in continuous spatial-temporal operations. It generates robust planning signals that guide the DM, which excels at de-noising and refining high-resolution video frames. This bifurcation of tasks addresses limitations associated with high-dimensional video data that challenge conventional AR models in natural language processing and underscores the adaptability of MarDini across several generative tasks through varying masking strategies.

Empirical Observations and Model Performance

MarDini’s performance, particularly in video interpolation on the VIDIM-Bench, sets a new benchmark, achieving significant improvements in the Fréchet Video Distance (FVD) metric over competitive models. This showcases the model's capability in handling long-range frame prediction tasks with a high degree of temporal coherence and visual fidelity. This efficacy is further underscored by the model’s capacity to deliver competitive performance in image-to-video generation tasks, evidenced by superior results on the VBench dataset.

The adaptability and scalability of MarDini are reflected in its ability to circumvent the need for extensive image-based pre-training, a conventional prerequisite in video generative models. Through incremental task difficulty adjustments during training, MarDini mitigates this limitation, providing an efficient, unified pathway from video interpolation to complete video generation without relying on upstream task pre-training routines.

Architectural Insights and Computational Efficiency

A standout feature of MarDini is its scalability, extending beyond simple generative rigidity to encompass a flexible task framework. Thanks to the hierarchical and autoregressive conditioning mechanisms, MarDini generates extended video sequences from minimal inputs, including autonomous video interpolation and expansion capabilities.

The computational efficacy of MarDini is primarily derived from its asymmetric design ethos. Tailoring computational complexities to operate at low resolution during the planning phase allows for the deployment of more sophisticated spatio-temporal attention mechanisms at scale, thereby reducing inference latency without compromising performance. This strategic design allocation mitigates memory constraints and positions MarDini as a memory-efficient model ideally suited for high-resolution video generation tasks.

Future Directions and Implications

The unique combination of MAR with DMs in MarDini presents new horizons for AI research, particularly in scalable video generation applications. Future work could explore the detailed integration of additional conditional signals such as text-based or motion-guided inputs, potentially widening its application scope and enhancing model robustness. Given MarDini's efficient use of computational resources and its scalable design, future iterations might drive advances in real-time video generation and applications across various multimedia platforms.

In conclusion, MarDini represents a promising direction in video generative modeling, transcending typical limitations found in fixed-resolution generation while offering advanced temporal coherence and spatial detailing. It effectively embodies a novel diffusion-based approach to autoregressive video generation that combines flexibility, efficiency, and scalability, marking a significant contribution to the AI generative modeling sphere.

PDF Markdown Bookmark Chat (Pro)

References (97)

Authors (15)

Haozhe Liu (36 papers)
Shikun Liu (21 papers)
Zijian Zhou (63 papers)
Mengmeng Xu (27 papers)
Yanping Xie (3 papers)
Xiao Han (127 papers)
Juan C. Pérez (18 papers)
Ding Liu (52 papers)
Kumara Kahatapitiya (20 papers)
Menglin Jia (17 papers)
Jui-Chieh Wu (4 papers)
Sen He (29 papers)
Tao Xiang (324 papers)
Jürgen Schmidhuber (124 papers)
Juan-Manuel Pérez-Rúa (5 papers)

Citations (5)

View on Semantic Scholar

Tweets

https://twitter.com/_akhaliq/status/1851106454201057559

https://twitter.com/arXivGPT/status/1851709573935059100

YouTube

Show All Videos

[2410.20280] MarDini: Masked Autoregressive Diffusion for Video Generation at Scale (3 points, 0 comments)