I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models (2311.04145v1)

Published 7 Nov 2023 in cs.CV

Abstract: Video synthesis has recently made remarkable strides benefiting from the rapid development of diffusion models. However, it still encounters challenges in terms of semantic accuracy, clarity and spatio-temporal continuity. They primarily arise from the scarcity of well-aligned text-video data and the complex inherent structure of videos, making it difficult for the model to simultaneously ensure semantic and qualitative excellence. In this report, we propose a cascaded I2VGen-XL approach that enhances model performance by decoupling these two factors and ensures the alignment of the input data by utilizing static images as a form of crucial guidance. I2VGen-XL consists of two stages: i) the base stage guarantees coherent semantics and preserves content from input images by using two hierarchical encoders, and ii) the refinement stage enhances the video's details by incorporating an additional brief text and improves the resolution to 1280$\times$720. To improve the diversity, we collect around 35 million single-shot text-video pairs and 6 billion text-image pairs to optimize the model. By this means, I2VGen-XL can simultaneously enhance the semantic accuracy, continuity of details and clarity of generated videos. Through extensive experiments, we have investigated the underlying principles of I2VGen-XL and compared it with current top methods, which can demonstrate its effectiveness on diverse data. The source code and models will be publicly available at \url{https://i2vgen-xl.github.io}.

Citations (137)

View on Semantic Scholar

Collections

Sign up for free to add this paper to one or more collections.

Sign Up

Summary

The paper introduces a cascaded diffusion framework that produces semantically coherent, low-resolution videos from static images.
It utilizes dual encoders—a fixed CLIP encoder and a learnable content encoder—to extract both high-level semantics and fine details for accurate video synthesis.
The refinement stage enhances resolution and temporal consistency, demonstrating superior performance on benchmarks with dynamic motion and structural fidelity.

Introduction

The field of video synthesis has seen impressive growth thanks to advancements in Diffusion Models, which have shown promise in generating realistic images. Nonetheless, the transition from static images to dynamic video creation presents additional challenges, particularly in achieving semantic accuracy, clarity, and spatio-temporal continuity. The newly proposed I2VGen-XL model tackles these obstacles by enhancing video generation through a cascading diffusion approach.

The I2VGen-XL Framework

I2VGen-XL operates in two main stages. The first is the base stage, which ensures the semantic coherence of generated low-resolution videos while preserving the content and structure of the input images. This is achieved through a combination of two hierarchical encoders – a fixed CLIP encoder that extracts high-level semantics, and a learnable content encoder for details. These encoders feed into a video diffusion model dedicated to this stage.

The second stage is the refinement stage, which increases video resolution to 1280×720 and refines details and artifacts present within the generated videos from the base stage. It utilizes a different video diffusion model, optimized for its initial 600 denoising steps and takes a simple text description as input for further refinement. The process grants higher definition with improved temporal and spatial coherence.

Training and Evaluation

For training I2VGen-XL, a substantial dataset of around 35 million text-video pairs and 6 billion text-image pairs has been compiled. In the base model, spatial components of a 3D UNet are initialized with pre-trained parameters to imbue the system with spatial generation capabilities. The refinement model inherits this prowess and undergoes specific training on high-resolution samples to focus on enhancing spatial-temporal detail.

Extensive evaluation benchmarks I2VGen-XL against contemporary top methods, where it demonstrates its superiority in generating videos with more dynamic realistic motion and maintaining content structure from the input images. The fine-tuning of the refinement model on the initial noise scales is central to its enhancements in video quality.

Limitations and Conclusion

I2VGen-XL marks a significant step forward in synthetic video generation, but it is not without its limitations. These include the challenges of generating stable human body movements and the capability to produce long videos with coherent storylines. Furthermore, the model's understanding of user intention from text prompts remains a critical area for improvement.

In conclusion, I2VGen-XL points toward a bright future for video synthesis, bridging the gap between image and video and fostering advances in the ease and quality of content creation. The commitment to public availability of source code and models makes I2VGen-XL accessible and marks an open invitation to the wider community for further exploration and development.

PDF Markdown

Follow-up Questions

Related Papers

Authors (9)

Tweets

https://twitter.com/WilliamLamkin/status/1744778579735687278

YouTube

Show All Videos