MALT Diffusion: Memory-Augmented Latent Transformers for Any-Length Video Generation (2502.12632v1)

Published 18 Feb 2025 in cs.CV and cs.LG

Abstract: Diffusion models are successful for synthesizing high-quality videos but are limited to generating short clips (e.g., 2-10 seconds). Synthesizing sustained footage (e.g. over minutes) still remains an open research question. In this paper, we propose MALT Diffusion (using Memory-Augmented Latent Transformers), a new diffusion model specialized for long video generation. MALT Diffusion (or just MALT) handles long videos by subdividing them into short segments and doing segment-level autoregressive generation. To achieve this, we first propose recurrent attention layers that encode multiple segments into a compact memory latent vector; by maintaining this memory vector over time, MALT is able to condition on it and continuously generate new footage based on a long temporal context. We also present several training techniques that enable the model to generate frames over a long horizon with consistent quality and minimal degradation. We validate the effectiveness of MALT through experiments on long video benchmarks. We first perform extensive analysis of MALT in long-contextual understanding capability and stability using popular long video benchmarks. For example, MALT achieves an FVD score of 220.4 on 128-frame video generation on UCF-101, outperforming the previous state-of-the-art of 648.4. Finally, we explore MALT's capabilities in a text-to-video generation setting and show that it can produce long videos compared with recent techniques for long text-to-video generation.

Summary

The paper introduces a novel integration of diffusion models with recurrent memory-augmented latent transformers for long video generation.
The methodology incorporates recurrent attention and noise augmentation strategies to maintain long-term stability and frame quality.
Experiments demonstrate significant FVD improvements on UCF-101 and Kinetics-600, validating its practical advantages in video synthesis.

MALT Diffusion: Memory-Augmented Latent Transformers for Any-Length Video Generation

The paper introduces MALT Diffusion, an innovative approach to generating any-length videos using diffusion models integrated with Memory-Augmented Latent Transformers. Video generation has traditionally faced significant challenges due to the high computational and memory requirements associated with modeling complex temporal sequences. This work proposes a novel methodology that addresses these challenges by focusing on two main aspects: long-term contextual understanding and long-term stability.

Key Contributions:

Recurrent Attention for Memory Encoding: MALT Diffusion introduces recurrent attention layers that effectively encode multiple video segments into compact latent memory vectors. This approach enables the model to maintain and utilize memory over time, allowing it to condition on longer temporal contexts. By employing such memory augmentation, MALT transforms are capable of segment-level autoregressive generation, thus handling extensive video lengths efficiently.
Training Techniques for Stability: The paper details several training strategies aimed at minimizing quality degradation over long video sequences. One notable strategy involves noise augmentation of memory vectors during training to enhance the robustness of the model against error accumulation. This helps maintain frame quality across extended video sequences and mitigates error propagation often observed in autoregressive models.
Outstanding Performance on Long Video Benchmarks: The effectiveness of MALT is demonstrated on popular long video benchmarks. For instance, the model achieves a Frechet Video Distance (FVD) score of 220.4 for 128-frame video generation on UCF-101, considerably outperforming the prior state-of-the-art score of 648.4. Additionally, on the Kinetics-600 dataset, MALT surpasses previous long video prediction records by achieving an FVD improvement from 799 to 392.
Text-to-Video Generation Capabilities: MALT Diffusion's applicability extends to text-to-video generation. The model shows proficiency in generating consistent, high-quality video sequences from text prompts, demonstrating stable quality even in videos lasting up to two minutes at 8 frames per second (FPS).

Theoretical and Practical Implications:

MALT Diffusion underscores a significant shift in handling video data by effectively integrating latent diffusion models with memory-augmented architectures. The strategy of utilizing recurrent memory vectors represents a theoretical advancement that could influence future approaches to long-sequence modeling in other domains, including climate simulations or complex temporal datasets.

Practically, this advancement facilitates the generation of longer and more coherent video sequences compared to existing models, thus advancing applications in multimedia content creation and potentially virtual reality experiences. The efficient handling of long contexts without exorbitant computational demands further underscores the potential for deploying such models in real-world applications.

Future Directions:

Future research could explore further optimizations in the attention mechanisms to enhance the scalability of memory-augmented models. Additionally, extending this framework to other types of generative tasks, such as 3D video generation, and integrating it with emerging LLMs to handle complex narrative-driven sequences pose exciting challenges and opportunities for the AI community.

Overall, MALT Diffusion presents a compelling step forward in the domain of video synthesis, offering a robust framework for generating high-quality, lengthy video sequences with maintained consistency and minimal degradation.

PDF Markdown

Tweets

https://twitter.com/MeeraHahn/status/1892992657850270176

MALT Diffusion: Memory-Augmented Latent Transformers for Any-Length Video Generation (2502.12632v1)

Summary

MALT Diffusion: Memory-Augmented Latent Transformers for Any-Length Video Generation

Related Papers

Tweets