AtomoVideo: High Fidelity Image-to-Video Generation (2403.01800v2)

Published 4 Mar 2024 in cs.CV

Abstract: Recently, video generation has achieved significant rapid development based on superior text-to-image generation techniques. In this work, we propose a high fidelity framework for image-to-video generation, named AtomoVideo. Based on multi-granularity image injection, we achieve higher fidelity of the generated video to the given image. In addition, thanks to high quality datasets and training strategies, we achieve greater motion intensity while maintaining superior temporal consistency and stability. Our architecture extends flexibly to the video frame prediction task, enabling long sequence prediction through iterative generation. Furthermore, due to the design of adapter training, our approach can be well combined with existing personalized models and controllable modules. By quantitatively and qualitatively evaluation, AtomoVideo achieves superior results compared to popular methods, more examples can be found on our project website: https://atomo-video.github.io/.

PDF HTML Abstract

AtomoVideo: An Approach to High Fidelity Image-to-Video Generation

Overview of AtomoVideo

AtomoVideo represents a novel framework geared towards image-to-video (I2V) generation that ensures a high degree of fidelity of the video to an input image. The framework leverages multi-granularity image injection to achieve a remarkable balance between motion intensity and temporal consistency. The architecture facilitates long video sequence prediction and supports integration with personalized models for customized video generation. In their comparison against existing methods, AtomoVideo demonstrates superior performance across several metrics.

Image-to-Video Generation Framework

AtomoVideo introduces a distinctive approach in addressing the challenges prevalent in the I2V generation domain. Key highlights of its methodology include:

Multi-Granularity Image Injection: This technique permits the incorporation of both high-level semantic cues and low-level image details, enhancing the fidelity and consistency of the generated video with respect to the source image.
Iterative Generation for Long Videos: By predicting successive video frames from preceding ones, AtomoVideo can generate elongated video sequences, a notable advancement in the field.
Adaptable Architecture: The framework's design enables seamless integration with pre-existing text-to-image (T2I) and controllable generative models, allowing for extensive customization.

Methodological Insights

The paper delineates the technical underpinnings of AtomoVideo, offering clear insights into its operational paradigm:

Overall Pipeline:
- Integration with pre-trained T2I models.
- Additions of 1D temporal convolution and attention modules for temporal depth.
- Employing image condition latent and binary mask to enrich input data.
Image Injection Strategy:
- Utilizing both VAE encoded low-level information and high-level semantic cues.
- Ensuring robust fidelity to the input image through comprehensive information injection.
Video Frame Prediction:
- Facilitation of long video generation through iterative prediction.
- Effective training strategies ensuring quick convergence and stability.

Evaluation and Performance

Quantitative and qualitative assessments underscore AtomoVideo's enhanced capabilities in generating videos that are not only consistent with the input image but also exhibit fluid motion and high quality. Particularly, the model demonstrates:

Improved Image Consistency and Temporal Consistency scores over its contemporaries.
Superior Motion Effects, highlighted by its RAFT scores.
Competitive Video Quality, as manifested in DOVER scores.

Implications and Future Directions

AtomoVideo sets a new benchmark in I2V generation with its methodological innovations and performance metrics. Its ability to produce high-quality videos from still images has profound implications for content creation in digital media, gaming, and virtual reality. Looking ahead, there’s potential for:

Enhanced controllability in video generation.
Integration with more advanced T2I models for even higher fidelity and quality.
Exploration of application-specific customizations for diverse industry needs.

Conclusion

This paper introduces AtomoVideo, a framework that significantly advances the fidelity and quality of image-to-video generation. Through innovative strategies for image injection and iterative long video generation, AtomoVideo achieves exceptional performance. The framework not only paves the way for highly realistic video content creation from static images but also offers promising avenues for further research and development in controllable and high-quality video generation technologies.