UniVG: Towards UNIfied-modal Video Generation (2401.09084v1)

Published 17 Jan 2024 in cs.CV

Abstract: Diffusion based video generation has received extensive attention and achieved considerable success within both the academic and industrial communities. However, current efforts are mainly concentrated on single-objective or single-task video generation, such as generation driven by text, by image, or by a combination of text and image. This cannot fully meet the needs of real-world application scenarios, as users are likely to input images and text conditions in a flexible manner, either individually or in combination. To address this, we propose a Unified-modal Video Genearation system that is capable of handling multiple video generation tasks across text and image modalities. To this end, we revisit the various video generation tasks within our system from the perspective of generative freedom, and classify them into high-freedom and low-freedom video generation categories. For high-freedom video generation, we employ Multi-condition Cross Attention to generate videos that align with the semantics of the input images or text. For low-freedom video generation, we introduce Biased Gaussian Noise to replace the pure random Gaussian Noise, which helps to better preserve the content of the input conditions. Our method achieves the lowest Fr\'echet Video Distance (FVD) on the public academic benchmark MSR-VTT, surpasses the current open-source methods in human evaluations, and is on par with the current close-source method Gen2. For more samples, visit https://univg-baidu.github.io.

References (65)

Authors (5)

Ludan Ruan (7 papers)
Lei Tian (78 papers)
Chuanwei Huang (9 papers)
Xu Zhang (343 papers)
Xinyan Xiao (41 papers)

Citations (2)

View on Semantic Scholar

Summary

The paper introduces UniVG, a novel approach that supports multi-task video generation by unifying text and image conditions under high- and low-freedom settings.
It employs a Multi-condition Cross Attention module for high-freedom tasks and Biased Gaussian Noise for low-freedom tasks to maintain precise adherence to input conditions.
Quantitative measures and human evaluations demonstrate UniVG's strong performance and frame consistency, rivaling leading systems like Gen2.

Overview of Uni-modal Video Generation

The pursuit of sophisticated video generation systems has led to remarkable advancements, particularly with diffusion-based generative models. Current systems primarily handle singular objectives, such as text-to-video or image-to-video generation. This limited scope fails to cater to users who require flexibility in input conditions, possibly lacking text or images or intending to use them interchangeably. To resolve this, a novel approach named Uni-modal Video Generation (UniVG) has been introduced, supporting multi-task video creation by accommodating a diverse range of input conditions across both text and image modalities.

Categorizing Video Generation Tasks

The core innovation of UniVG lies in its categorization of video generation tasks into high-freedom and low-freedom categories based on the construct of generative freedom. High-freedom tasks receive loosely defined input conditions, granting the model a broad canvas to render videos. In contrast, low-freedom tasks work within strict constraints, often at the pixel level, necessitating precise adherence to input conditions.

For high-freedom applications, UniVG deploys a Multi-condition Cross Attention module. It synchronizes video content with the semantics of texts and images, offering a high degree of creative liberty to the generative model. On the other hand, Biased Gaussian Noise mechanisms are introduced for low-freedom tasks. This innovative alternative to random Gaussian Noise aids in preserving the integrity of the input conditions, optimizing content retention during video generation.

Advancements in System Performance

UniVG significantly surpasses existing methods in quantitative measures and equates to Gen2, a leading closed-source method in human evaluations. It accommodates flexible conditions for video generation, ranging from text/image-driven creations to intricate tasks like image animation and super-resolution. The system particularly excels in frame consistency, contributing to its superiority in creating coherent and visually appealing videos.

Moreover, the UniVG framework allows for the scaling of influence from text and image inputs, yielding a spectrum of videos that range from dominantly text-informed to richly image-aligned narratives — a testament to the system's adaptability.

Future Directions and Conclusion

This novel approach not only opens a new chapter in video generation but also suggests potential applications for other constrained generation tasks. While the current iteration already demonstrates an impressive ability to generate high-quality, aligned videos, future work may explore enhancing dynamic elements and exploring applications in related domains. The introduction of UniVG marks a turning point, offering a promising new toolkit for the generation of videos across a vast array of input conditions, thus significantly broadening the horizon for both creators and AI in the field of video production.

Related Papers

GitHub

UniVG

Tweets

https://twitter.com/_akhaliq/status/1747860016449679534

https://twitter.com/_akhaliq/status/1747817404401426695

https://twitter.com/javaeeeee1/status/1747960303277453751

https://twitter.com/javaeeeee1/status/1749056842418823629