Video-Infinity: Distributed Long Video Generation (2406.16260v1)

Published 24 Jun 2024 in cs.CV and cs.AI

Abstract: Diffusion models have recently achieved remarkable results for video generation. Despite the encouraging performances, the generated videos are typically constrained to a small number of frames, resulting in clips lasting merely a few seconds. The primary challenges in producing longer videos include the substantial memory requirements and the extended processing time required on a single GPU. A straightforward solution would be to split the workload across multiple GPUs, which, however, leads to two issues: (1) ensuring all GPUs communicate effectively to share timing and context information, and (2) modifying existing video diffusion models, which are usually trained on short sequences, to create longer videos without additional training. To tackle these, in this paper we introduce Video-Infinity, a distributed inference pipeline that enables parallel processing across multiple GPUs for long-form video generation. Specifically, we propose two coherent mechanisms: Clip parallelism and Dual-scope attention. Clip parallelism optimizes the gathering and sharing of context information across GPUs which minimizes communication overhead, while Dual-scope attention modulates the temporal self-attention to balance local and global contexts efficiently across the devices. Together, the two mechanisms join forces to distribute the workload and enable the fast generation of long videos. Under an 8 x Nvidia 6000 Ada GPU (48G) setup, our method generates videos up to 2,300 frames in approximately 5 minutes, enabling long video generation at a speed 100 times faster than the prior methods.

PDF HTML Abstract

Overview of Video-Infinity: Distributed Long Video Generation

This paper introduces a novel framework named Video-Infinity, designed to address the significant challenges in generating long-form videos using diffusion models. The paper addresses two primary obstacles: the substantial memory requirement and the extended processing time when using a single GPU for long video generation. By leveraging multiple GPUs, Video-Infinity offers a distributed approach to efficiently generate lengthy videos.

Challenges in Long Video Generation

The rise of diffusion models has marked a significant milestone in video generation, particularly with models like DDPM and LDM demonstrating impressive results. However, most of these models are confined to generating short video clips (16-24 frames), primarily due to the high memory demands and computational costs when extending to longer sequences.

Existing methods to extend video length include autoregressive, hierarchical, and short-to-long strategies. However, these approaches face several limitations, such as lack of end-to-end integration, high computational load, and difficulties in maintaining global consistency across video segments.

The Video-Infinity Framework

Video-Infinity overcomes these challenges by implementing a distributed inference pipeline, which parallelizes the video generation process across multiple GPUs, leveraging two innovative mechanisms: Clip parallelism and Dual-scope attention.

Clip Parallelism

Clip parallelism enables efficient workload distribution among multiple GPUs. It segments the input video latent into smaller, manageable clips, each processed on a different GPU. This mechanism minimizes communication overhead while ensuring the synchronization of context information among GPUs. Each GPU processes its segment in three stages:

Broadcasting global context information to all devices.
Exchanging local context information with neighboring GPUs.
Final synchronization to ensure all devices have the required context for processing.

Dual-Scope Attention

The Dual-scope attention mechanism optimizes the temporal self-attention in video diffusion models by balancing local and global contexts. This is achieved through:

Local Context: Gathering key and value pairs from neighboring frames, ensuring the continuity of immediate temporal information.
Global Context: Incorporating key and value pairs from frames sampled across the video to maintain long-range temporal coherence.

By interleaving local and global contexts, Dual-scope attention enhances the coherence of extended video segments without requiring additional model training.

Experimental Results and Comparative Analysis

The efficacy of Video-Infinity was demonstrated using a setup of 8 Nvidia 6000 Ada GPUs. This arrangement successfully generated videos up to 2300 frames (approximately 95 seconds at 24 fps) in just 5 minutes, showcasing significant improvement over previous methods.

Performance Metrics

Video-Infinity was evaluated against existing methods, including FreeNoise, Streaming T2V, and OpenSora V1.1. The results highlighted its superior performance in terms of both capacity and efficiency:

Maximum Frame Capability: Video-Infinity can generate videos up to 2300 frames, surpassing others by a substantial margin.
Generation Speed: It was 100 times faster than Streaming T2V for generating 1024-frame videos.
Video Quality: Evaluated using VBench metrics, Video-Infinity maintained high-quality scores across various dimensions, including subject consistency, motion smoothness, and dynamic degree.

Theoretical and Practical Implications

The introduction of Video-Infinity has significant implications for both theoretical advancements and practical applications in AI-driven video generation:

Scalability: The distributed approach enables the generation of much longer videos without compromising quality, paving the way for scalable video generation technologies.
Efficiency: By effectively distributing computational loads and optimizing communication strategies, this method reduces processing times, making long video generation more feasible for real-world applications.

Future Directions

The paper opens several avenues for future research:

Enhanced Synchronization Techniques: Further improvements in inter-GPU communication protocols could minimize latency and increase efficiency.
Adaptation to Different Architectures: Adapting the principles of Clip parallelism and Dual-scope attention to other model architectures could expand its applicability.
Handling Scene Transitions: Developing mechanisms to handle smooth transitions between distinct scenes within long videos could enhance the overall utility of the method.

In conclusion, Video-Infinity presents a substantial advancement in the domain of long-video generation, addressing critical limitations of existing methods through innovative distributed processing techniques. Its contributions lay foundational work for future developments in efficient, scalable video generation using AI.