LinGen: Towards High-Resolution Minute-Length Text-to-Video Generation with Linear Computational Complexity (2412.09856v2)

Published 13 Dec 2024 in cs.CV, cs.AI, cs.LG, and eess.IV

Abstract: Text-to-video generation enhances content creation but is highly computationally intensive: The computational cost of Diffusion Transformers (DiTs) scales quadratically in the number of pixels. This makes minute-length video generation extremely expensive, limiting most existing models to generating videos of only 10-20 seconds length. We propose a Linear-complexity text-to-video Generation (LinGen) framework whose cost scales linearly in the number of pixels. For the first time, LinGen enables high-resolution minute-length video generation on a single GPU without compromising quality. It replaces the computationally-dominant and quadratic-complexity block, self-attention, with a linear-complexity block called MATE, which consists of an MA-branch and a TE-branch. The MA-branch targets short-to-long-range correlations, combining a bidirectional Mamba2 block with our token rearrangement method, Rotary Major Scan, and our review tokens developed for long video generation. The TE-branch is a novel TEmporal Swin Attention block that focuses on temporal correlations between adjacent tokens and medium-range tokens. The MATE block addresses the adjacency preservation issue of Mamba and improves the consistency of generated videos significantly. Experimental results show that LinGen outperforms DiT (with a 75.6% win rate) in video quality with up to 15$\times$ (11.5$\times$) FLOPs (latency) reduction. Furthermore, both automatic metrics and human evaluation demonstrate our LinGen-4B yields comparable video quality to state-of-the-art models (with a 50.5%, 52.1%, 49.1% win rate with respect to Gen-3, LumaLabs, and Kling, respectively). This paves the way to hour-length movie generation and real-time interactive video generation. We provide 68s video generation results and more examples in our project website: https://lineargen.github.io/.

Summary

The paper introduces LinGen, which replaces quadratic self-attention with a linear-complexity MATE block to achieve up to 15× FLOPs reduction and 11.5× lower latency.
Its dual-path design features an MA-branch with Rotary Major Scan and a TE-branch with Temporal Swin Attention to ensure spatial detail and temporal consistency.
The framework achieves video quality comparable to state-of-the-art models through progressive, hybrid training, paving the way for accessible high-resolution video generation.

LinGen: A Linear Complexity Approach to High-Resolution Minute-Length Text-to-Video Generation

The paper presents LinGen, a novel framework for text-to-video generation focusing on high-resolution, minute-length videos while maintaining linear computational complexity. This shifts away from traditional models whose complexity grows quadratically with the number of pixels, which proves computationally prohibitive for long videos. By replacing self-attention blocks with a linear-complexity module named MATE, LinGen effectively manages to generate detailed and temporally consistent videos at a scale previously infeasible.

The MATE block is essential to LinGen's design, featuring the bidirectional Mamba2 block, Rotary Major Scan (RMS), and review tokens in its MA-branch, as well as the TEmporal Swin Attention (TESA) block in its TE-branch. The MA-branch handles diverse correlation ranges, utilizing the Rotary Major Scan for sequence arrangement that maintains adjacency, which is crucial for preserving visual consistency across frames. Meanwhile, the TE-branch addresses the adjacency preservation concern and assures temporal coherence through the TESA block. Together, these innovations in the MATE block significantly enhance video consistency without quadratic complexity overhead.

Empirical evidence provided in the paper underscores LinGen's efficiency in resource utilization alongside competitive video quality. The model achieves a considerable reduction in FLOPs and latency—up to 15× and 11.5× respectively, against the traditional DiT framework—enabling generation on a single GPU. Notably, the quality of videos produced by LinGen, assessed through both automatic metrics and human evaluations, is found comparable to leading state-of-the-art models like Runway Gen3 and Kling, demonstrating a balanced improvement in both visual quality and computational efficiency.

The implications of LinGen's success are manifold. Practically, this presents substantial advancements in making video generation more accessible, allowing for real-time or near-real-time applications with commodity hardware. Theoretically, it challenges the prevailing notion that only quadratic-complexity mechanisms could achieve high-quality generative tasks in extended applications like videos.

LinGen's design leans into progressive training schemes, hybrid training incorporating both text-to-video and text-to-image tasks, and strategic quality tuning on curated high-quality datasets. These methodologies collectively contribute to its robust performance across multiple dimensions of video quality and efficiency. The progressive training aspect crucially facilitates its adaptability to increasing token sequences as the video length and resolution requirements grow.

This research points towards further exploration into expanding LinGen's capabilities, potentially extending to hour-long video generation. Such development invites broader discussions on optimizing architecture for various generative tasks while maintaining computationally feasible requirements. The MATE block itself could serve as a template for efficiently handling diverse other tasks necessitating large-scale sequential processing with manageable computational impacts.

In conclusion, LinGen addresses significant challenges in text-to-video generation and demonstrates notable scalability in generating high-resolution, minute-length videos through linear complexity. It presents a compelling case for re-evaluating complexity assumptions and architectures in video generation, driving advancements in both practical applications and theoretical foundations of generative model designs.