Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

AtomoVideo: High Fidelity Image-to-Video Generation (2403.01800v2)

Published 4 Mar 2024 in cs.CV

Abstract: Recently, video generation has achieved significant rapid development based on superior text-to-image generation techniques. In this work, we propose a high fidelity framework for image-to-video generation, named AtomoVideo. Based on multi-granularity image injection, we achieve higher fidelity of the generated video to the given image. In addition, thanks to high quality datasets and training strategies, we achieve greater motion intensity while maintaining superior temporal consistency and stability. Our architecture extends flexibly to the video frame prediction task, enabling long sequence prediction through iterative generation. Furthermore, due to the design of adapter training, our approach can be well combined with existing personalized models and controllable modules. By quantitatively and qualitatively evaluation, AtomoVideo achieves superior results compared to popular methods, more examples can be found on our project website: https://atomo-video.github.io/.

AtomoVideo: An Approach to High Fidelity Image-to-Video Generation

Overview of AtomoVideo

AtomoVideo represents a novel framework geared towards image-to-video (I2V) generation that ensures a high degree of fidelity of the video to an input image. The framework leverages multi-granularity image injection to achieve a remarkable balance between motion intensity and temporal consistency. The architecture facilitates long video sequence prediction and supports integration with personalized models for customized video generation. In their comparison against existing methods, AtomoVideo demonstrates superior performance across several metrics.

Image-to-Video Generation Framework

AtomoVideo introduces a distinctive approach in addressing the challenges prevalent in the I2V generation domain. Key highlights of its methodology include:

  • Multi-Granularity Image Injection: This technique permits the incorporation of both high-level semantic cues and low-level image details, enhancing the fidelity and consistency of the generated video with respect to the source image.
  • Iterative Generation for Long Videos: By predicting successive video frames from preceding ones, AtomoVideo can generate elongated video sequences, a notable advancement in the field.
  • Adaptable Architecture: The framework's design enables seamless integration with pre-existing text-to-image (T2I) and controllable generative models, allowing for extensive customization.

Methodological Insights

The paper delineates the technical underpinnings of AtomoVideo, offering clear insights into its operational paradigm:

  1. Overall Pipeline:
    • Integration with pre-trained T2I models.
    • Additions of 1D temporal convolution and attention modules for temporal depth.
    • Employing image condition latent and binary mask to enrich input data.
  2. Image Injection Strategy:
    • Utilizing both VAE encoded low-level information and high-level semantic cues.
    • Ensuring robust fidelity to the input image through comprehensive information injection.
  3. Video Frame Prediction:
    • Facilitation of long video generation through iterative prediction.
    • Effective training strategies ensuring quick convergence and stability.

Evaluation and Performance

Quantitative and qualitative assessments underscore AtomoVideo's enhanced capabilities in generating videos that are not only consistent with the input image but also exhibit fluid motion and high quality. Particularly, the model demonstrates:

  • Improved Image Consistency and Temporal Consistency scores over its contemporaries.
  • Superior Motion Effects, highlighted by its RAFT scores.
  • Competitive Video Quality, as manifested in DOVER scores.

Implications and Future Directions

AtomoVideo sets a new benchmark in I2V generation with its methodological innovations and performance metrics. Its ability to produce high-quality videos from still images has profound implications for content creation in digital media, gaming, and virtual reality. Looking ahead, there’s potential for:

  • Enhanced controllability in video generation.
  • Integration with more advanced T2I models for even higher fidelity and quality.
  • Exploration of application-specific customizations for diverse industry needs.

Conclusion

This paper introduces AtomoVideo, a framework that significantly advances the fidelity and quality of image-to-video generation. Through innovative strategies for image injection and iterative long video generation, AtomoVideo achieves exceptional performance. The framework not only paves the way for highly realistic video content creation from static images but also offers promising avenues for further research and development in controllable and high-quality video generation technologies.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (45)
  1. Gen-2. https://research.runwayml.com/gen2. Accessed 2023-12-15.
  2. Pika 1.0. https://pika.art/. Accessed 2023-12-15.
  3. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023a.
  4. Align your latents: High-resolution video synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22563–22575, 2023b.
  5. Videocrafter1: Open diffusion models for high-quality video generation. arXiv preprint arXiv:2310.19512, 2023a.
  6. Livephoto: Real image animation with text-guided motion control. arXiv preprint arXiv:2312.02928, 2023b.
  7. Emu: Enhancing image generation models using photogenic needles in a haystack. arXiv preprint arXiv:2309.15807, 2023a.
  8. Fine-grained open domain image animation with motion guidance. arXiv preprint arXiv:2311.12886, 2023b.
  9. Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34:8780–8794, 2021.
  10. Structure and content-guided video synthesis with diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7346–7356, 2023.
  11. Aigcbench: Comprehensive evaluation of image-to-video content generated by ai. BenchCouncil Transactions on Benchmarks, Standards and Evaluations, page 100152, 2024.
  12. Preserve your own correlation: A noise prior for video diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 22930–22941, 2023.
  13. Emu video: Factorizing text-to-video generation by explicit image conditioning. arXiv preprint arXiv:2311.10709, 2023.
  14. I2v-adapter: A general image-to-video adapter for video diffusion models. arXiv preprint arXiv:2312.16693, 2023a.
  15. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv preprint arXiv:2307.04725, 2023b.
  16. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
  17. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
  18. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022.
  19. Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022.
  20. Composer: Creative and controllable image synthesis with composable conditions. arXiv preprint arXiv:2302.09778, 2023.
  21. Common diffusion noise schedules and sample steps are flawed. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 5404–5411, 2024.
  22. Follow your pose: Pose-guided text-to-video generation using pose-free videos. arXiv preprint arXiv:2304.01186, 2023.
  23. T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. arXiv preprint arXiv:2302.08453, 2023.
  24. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023.
  25. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021a.
  26. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021b.
  27. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
  28. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022.
  29. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10684–10695, 2022.
  30. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35:36479–36494, 2022.
  31. Progressive distillation for fast sampling of diffusion models. arXiv preprint arXiv:2202.00512, 2022.
  32. Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792, 2022.
  33. Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning, pages 2256–2265. PMLR, 2015.
  34. Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456, 2020.
  35. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  36. Modelscope text-to-video technical report. arXiv preprint arXiv:2308.06571, 2023a.
  37. Videocomposer: Compositional video synthesis with motion controllability. arXiv preprint arXiv:2306.02018, 2023b.
  38. Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:2309.15103, 2023c.
  39. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4):600–612, 2004.
  40. Exploring video quality assessment on user generated contents from aesthetic and technical perspectives. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 20144–20154, 2023.
  41. Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models. 2023.
  42. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3836–3847, 2023a.
  43. I2vgen-xl: High-quality image-to-video synthesis via cascaded diffusion models. arXiv preprint arXiv:2311.04145, 2023b.
  44. Controlvideo: Training-free controllable text-to-video generation. arXiv preprint arXiv:2305.13077, 2023c.
  45. Pia: Your personalized image animator via plug-and-play modules in text-to-image models. arXiv preprint arXiv:2312.13964, 2023d.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Litong Gong (4 papers)
  2. Yiran Zhu (13 papers)
  3. Weijie Li (30 papers)
  4. Xiaoyang Kang (7 papers)
  5. Biao Wang (93 papers)
  6. Tiezheng Ge (46 papers)
  7. Bo Zheng (205 papers)
Citations (7)
Github Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com