- The paper introduces LinGen, which replaces quadratic self-attention with a linear-complexity MATE block to achieve up to 15× FLOPs reduction and 11.5× lower latency.
- Its dual-path design features an MA-branch with Rotary Major Scan and a TE-branch with Temporal Swin Attention to ensure spatial detail and temporal consistency.
- The framework achieves video quality comparable to state-of-the-art models through progressive, hybrid training, paving the way for accessible high-resolution video generation.
LinGen: A Linear Complexity Approach to High-Resolution Minute-Length Text-to-Video Generation
The paper presents LinGen, a novel framework for text-to-video generation focusing on high-resolution, minute-length videos while maintaining linear computational complexity. This shifts away from traditional models whose complexity grows quadratically with the number of pixels, which proves computationally prohibitive for long videos. By replacing self-attention blocks with a linear-complexity module named MATE, LinGen effectively manages to generate detailed and temporally consistent videos at a scale previously infeasible.
The MATE block is essential to LinGen's design, featuring the bidirectional Mamba2 block, Rotary Major Scan (RMS), and review tokens in its MA-branch, as well as the TEmporal Swin Attention (TESA) block in its TE-branch. The MA-branch handles diverse correlation ranges, utilizing the Rotary Major Scan for sequence arrangement that maintains adjacency, which is crucial for preserving visual consistency across frames. Meanwhile, the TE-branch addresses the adjacency preservation concern and assures temporal coherence through the TESA block. Together, these innovations in the MATE block significantly enhance video consistency without quadratic complexity overhead.
Empirical evidence provided in the paper underscores LinGen's efficiency in resource utilization alongside competitive video quality. The model achieves a considerable reduction in FLOPs and latency—up to 15× and 11.5× respectively, against the traditional DiT framework—enabling generation on a single GPU. Notably, the quality of videos produced by LinGen, assessed through both automatic metrics and human evaluations, is found comparable to leading state-of-the-art models like Runway Gen3 and Kling, demonstrating a balanced improvement in both visual quality and computational efficiency.
The implications of LinGen's success are manifold. Practically, this presents substantial advancements in making video generation more accessible, allowing for real-time or near-real-time applications with commodity hardware. Theoretically, it challenges the prevailing notion that only quadratic-complexity mechanisms could achieve high-quality generative tasks in extended applications like videos.
LinGen's design leans into progressive training schemes, hybrid training incorporating both text-to-video and text-to-image tasks, and strategic quality tuning on curated high-quality datasets. These methodologies collectively contribute to its robust performance across multiple dimensions of video quality and efficiency. The progressive training aspect crucially facilitates its adaptability to increasing token sequences as the video length and resolution requirements grow.
This research points towards further exploration into expanding LinGen's capabilities, potentially extending to hour-long video generation. Such development invites broader discussions on optimizing architecture for various generative tasks while maintaining computationally feasible requirements. The MATE block itself could serve as a template for efficiently handling diverse other tasks necessitating large-scale sequential processing with manageable computational impacts.
In conclusion, LinGen addresses significant challenges in text-to-video generation and demonstrates notable scalability in generating high-resolution, minute-length videos through linear complexity. It presents a compelling case for re-evaluating complexity assumptions and architectures in video generation, driving advancements in both practical applications and theoretical foundations of generative model designs.