Papers
Topics
Authors
Recent
2000 character limit reached

PIA: Your Personalized Image Animator via Plug-and-Play Modules in Text-to-Image Models (2312.13964v3)

Published 21 Dec 2023 in cs.CV and cs.AI

Abstract: Recent advancements in personalized text-to-image (T2I) models have revolutionized content creation, empowering non-experts to generate stunning images with unique styles. While promising, adding realistic motions into these personalized images by text poses significant challenges in preserving distinct styles, high-fidelity details, and achieving motion controllability by text. In this paper, we present PIA, a Personalized Image Animator that excels in aligning with condition images, achieving motion controllability by text, and the compatibility with various personalized T2I models without specific tuning. To achieve these goals, PIA builds upon a base T2I model with well-trained temporal alignment layers, allowing for the seamless transformation of any personalized T2I model into an image animation model. A key component of PIA is the introduction of the condition module, which utilizes the condition frame and inter-frame affinity as input to transfer appearance information guided by the affinity hint for individual frame synthesis in the latent space. This design mitigates the challenges of appearance-related image alignment within and allows for a stronger focus on aligning with motion-related guidance.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (45)
  1. AUTOMATIC1111. https://github.com/AUTOMATIC1111/stable-diffusion-webui.
  2. Frozen in time: A joint video and image encoder for end-to-end retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1728–1738, 2021.
  3. Blowing in the wind: Cyclenet for human cinemagraphs from still images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 459–468, 2023.
  4. Align your latents: High-resolution video synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22563–22575, 2023.
  5. Animating pictures with stochastic motion textures. In ACM SIGGRAPH 2005 Papers, pages 853–860. 2005.
  6. CIVITAI, 2022. https://civitai.com/.
  7. Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34:8780–8794, 2021.
  8. Animating landscape: self-supervised learning of decoupled motion and appearance for single-image video synthesis. ACM Transactions on Graphics (TOG), 38(6):1–19, 2019.
  9. Structure and content-guided video synthesis with diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7346–7356, 2023.
  10. Gen2, 2023. https://research.runwayml.com/gen2.
  11. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv preprint arXiv:2307.04725, 2023.
  12. Flexible diffusion modeling of long videos. Advances in Neural Information Processing Systems, 35:27953–27965, 2022.
  13. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  14. CLIPScore: a reference-free evaluation metric for image captioning. In EMNLP, 2021.
  15. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
  16. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
  17. Video diffusion models, 2022.
  18. Animating pictures with eulerian motion fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5810–5819, 2021.
  19. Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2021.
  20. Huggingface, 2022. https://huggingface.co/.
  21. Generative image dynamics. arXiv preprint arXiv:2309.07906, 2023.
  22. Videofusion: Decomposed diffusion models for high-quality video generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10209–10218, 2023.
  23. Text-guided synthesis of eulerian cinemagraphs. SIGGRAPH ASIA, 2023.
  24. Improved denoising diffusion probabilistic models. In International Conference on Machine Learning, pages 8162–8171. PMLR, 2021.
  25. Pika Labs, 2023. https://www.pika.art/.
  26. Sdxl: improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023.
  27. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021a.
  28. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021b.
  29. Zero-shot text-to-image generation. In International Conference on Machine Learning, pages 8821–8831. PMLR, 2021.
  30. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
  31. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
  32. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22500–22510, 2023.
  33. s9roll7. https://github.com/s9roll7/animatediff-cli-prompt-travel.
  34. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35:36479–36494, 2022.
  35. Learning dynamic facial radiance fields for few-shot talking head synthesis. In European Conference on Computer Vision, pages 666–682. Springer, 2022.
  36. Difftalk: Crafting diffusion models for generalized audio-driven portraits animation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1982–1991, 2023.
  37. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020.
  38. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  39. Videocomposer: Compositional video synthesis with motion controllability. arXiv preprint arXiv:2306.02018, 2023.
  40. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7623–7633, 2023.
  41. Automatic animation of hair blowing in still portrait photos. In ICCV, 2023.
  42. Diffusion probabilistic modeling for video generation. Entropy, 25(10):1469, 2023.
  43. Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models. 2023.
  44. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3836–3847, 2023a.
  45. I2vgen-xl: High-quality image-to-video synthesis via cascaded diffusion models. arXiv preprint arXiv:2311.04145, 2023b.
Citations (20)

Summary

  • The paper introduces PIA, a method that integrates temporal alignment layers and an innovative condition module to animate personalized images guided by text.
  • It achieves superior motion controllability and precise image alignment, as validated by extensive evaluations on the AnimateBench benchmark.
  • PIA’s plug-and-play design enables flexible motion magnitude control and style transfer, broadening its applications in personalized content creation.

Introduction to Personalized Image Animation

The ability to generate personalized images using text-to-image (T2I) models has significantly enhanced creative content production. These advanced models enable users to create images that reflect their unique styles and interests. However, the next step in the evolution of T2I models is animating these static images. The key objective is not just to animate images but also to control the motion and maintain high-fidelity details through text directions.

Research Breakthrough

A newly introduced procedure, known as Personalized Image Animator (PIA), demonstrates outstanding performance in animating personalized images with realistic motions. PIA aligns seamlessly with the base text-to-image model to maintain the distinct styles of the original images. It incorporates well-trained temporal alignment layers and an innovative condition module that plays a pivotal role in transferring appearance details from a reference image, all while being guided by text instructions. By focusing more on motion arrangement, this method significantly enhances motion controllability, which was previously challenging when aligning images and adding motion through text alone.

Evaluation and Results

To properly evaluate and benchmark PIA's effectiveness, AnimateBench was developed, a comprehensive benchmark that includes a diversity of personalized T2I models, curated images, and motion-associated prompts. Extensive tests show that PIA surpasses alternative methods in motion controllability as portrayed by the high-fidelity replication of the original image, aligning even with varied text inputs. Qualitative and quantitative evaluations on AnimateBench present robust evidence of PIA's animating capabilities.

Conclusion and Applications

PIA offers a powerful and flexible solution for personalized image animation. Its superior image alignment, motion controllability by text, and seamless integration with various personalized T2I models make it an engaging and customizable animation experience for users. Moreover, PIA allows for intriguing applications such as motion control by text prompt, motion magnitude controllability, and even style transfer in videos created by applying a personalized T2I model with a different domain style to the given image. PIA reshapes the possibilities within the personalized content community by mitigating the trade-off between appearance consistency and motion controllability often found in previous animation attempts.

Whiteboard

Paper to Video (Beta)

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 4 tweets with 1 like about this paper.