VideoTetris: Towards Compositional Text-to-Video Generation (2406.04277v1)

Published 6 Jun 2024 in cs.CV

Abstract: Diffusion models have demonstrated great success in text-to-video (T2V) generation. However, existing methods may face challenges when handling complex (long) video generation scenarios that involve multiple objects or dynamic changes in object numbers. To address these limitations, we propose VideoTetris, a novel framework that enables compositional T2V generation. Specifically, we propose spatio-temporal compositional diffusion to precisely follow complex textual semantics by manipulating and composing the attention maps of denoising networks spatially and temporally. Moreover, we propose an enhanced video data preprocessing to enhance the training data regarding motion dynamics and prompt understanding, equipped with a new reference frame attention mechanism to improve the consistency of auto-regressive video generation. Extensive experiments demonstrate that our VideoTetris achieves impressive qualitative and quantitative results in compositional T2V generation. Code is available at: https://github.com/YangLing0818/VideoTetris

PDF HTML Abstract

VideoTetris: Towards Compositional Text-to-Video Generation

Introduction

The paper introduces VideoTetris, a novel framework aimed at enhancing the capabilities of text-to-video (T2V) generation models, particularly in handling complex, compositional, and long video generation scenarios. Current state-of-the-art diffusion models for T2V generation struggle with scenarios requiring the composition of multiple objects or dynamic changes in object numbers, often resulting in outputs that fail to accurately follow complex input text prompts. VideoTetris addresses these limitations through a combination of spatio-temporal compositional diffusion and enhanced data preprocessing techniques.

Core Contributions

The paper presents multiple key contributions to the field of text-to-video generation:

Spatio-Temporal Compositional Diffusion:
- The authors propose a novel diffusion method that manipulates the cross-attention mechanisms of denoising networks both spatially and temporally. This method enables the precise composition of complex textual semantics in generated videos.
Enhanced Video Data Preprocessing:
- An improved preprocessing pipeline is designed to enhance training data in terms of motion dynamics and prompt understanding. This includes methods to filter and recaption video data, ensuring that the training set contains high-quality, semantically rich video-text pairs.
Reference Frame Attention Mechanism:
- A new regularization technique is introduced to maintain content consistency across frames in auto-regressive video generation. The Reference Frame Attention mechanism ensures that multiple objects maintain their appearance and positions consistently throughout generated videos.

Detailed Insights

Spatio-Temporal Compositional Diffusion

The proposed Spatio-Temporal Compositional Diffusion technique localizes subobjects within video frames by decomposing input prompts both temporally and spatially. The method calculates cross-attention values for each sub-object and synthesizes attention maps to integrate these sub-objects naturally within the denoising process. This approach is algorithmically efficient and does not require retraining of the underlying diffusion model.

Enhanced Video Data Preprocessing Pipeline

The preprocessing pipeline leverages enhanced motion dynamics and prompt semantics to ensure high-quality training data. Specifically, the authors use optical flow metrics to filter video data, selecting videos that exhibit appropriate levels of motion dynamics. Additionally, multimodal LLMs are employed to recaption video descriptions, thereby enriching the semantic content of training prompts.

Reference Frame Attention Mechanism

To address the challenge of maintaining object consistency across frames, the paper introduces the Reference Frame Attention mechanism. This technique aligns reference images with latent frame features, providing a cohesive representation throughout the video. This approach ensures that added or removed objects during long-video generation maintain consistent appearances and positions.

Experimental Results

Short Compositional Video Generation

VideoTetris achieves superior performance in generating videos adhering to complex compositional prompts. The framework surpasses state-of-the-art models such as ModelScope, VideoCrafter2, and AnimateDiff in terms of both VBLIP-VQA and VUnidet scores. The generated videos exhibit clear and distinctive objects that align accurately with the input prompts, setting a new benchmark for compositional text-to-video generation.

Long Progressive Compositional Video Generation

In generating long videos with progressive compositional prompts, VideoTetris demonstrates strong capabilities in object integration and motion dynamics. The framework consistently outperforms competing models like FreeNoise and StreamingT2V, achieving higher VBLIP-VQA and VUnidet scores while maintaining robust content consistency.

Implications and Future Directions

The practical implications of VideoTetris are profound. By enabling accurate and coherent compositional video generation, this framework can significantly enhance applications in autonomous media, creative content generation, and educational technologies. Theoretically, the proposed spatio-temporal compositional diffusion and enhanced data preprocessing techniques provide valuable insights for future research in generative models.

Future work could explore more generalized methods for long video generation, potentially leveraging more efficient training techniques to overcome current computational limitations. Additionally, integrating compositional generation methods guided by various input conditions could broaden the application scope of such models.

Conclusion

VideoTetris introduces a robust and effective framework for compositional text-to-video generation, addressing significant gaps in current state-of-the-art models. The innovative spatio-temporal compositional diffusion method, enhanced data preprocessing pipeline, and Reference Frame Attention mechanism collectively contribute to generating high-quality, semantically accurate videos. This work lays a strong foundation for future advancements in the field, providing both practical tools and theoretical frameworks to enhance the capabilities of text-to-video diffusion models.