Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Animate Anyone: Consistent and Controllable Image-to-Video Synthesis for Character Animation (2311.17117v3)

Published 28 Nov 2023 in cs.CV

Abstract: Character Animation aims to generating character videos from still images through driving signals. Currently, diffusion models have become the mainstream in visual generation research, owing to their robust generative capabilities. However, challenges persist in the realm of image-to-video, especially in character animation, where temporally maintaining consistency with detailed information from character remains a formidable problem. In this paper, we leverage the power of diffusion models and propose a novel framework tailored for character animation. To preserve consistency of intricate appearance features from reference image, we design ReferenceNet to merge detail features via spatial attention. To ensure controllability and continuity, we introduce an efficient pose guider to direct character's movements and employ an effective temporal modeling approach to ensure smooth inter-frame transitions between video frames. By expanding the training data, our approach can animate arbitrary characters, yielding superior results in character animation compared to other image-to-video methods. Furthermore, we evaluate our method on benchmarks for fashion video and human dance synthesis, achieving state-of-the-art results.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (61)
  1. Wasserstein generative adversarial networks. In International conference on machine learning, pages 214–223. PMLR, 2017.
  2. ediffi: Text-to-image diffusion models with an ensemble of expert denoisers. arXiv preprint arXiv:2211.01324, 2022.
  3. Person image synthesis via denoising diffusion model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5968–5976, 2023.
  4. Align your latents: High-resolution video synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22563–22575, 2023.
  5. Realtime multi-person 2d pose estimation using part affinity fields. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7291–7299, 2017.
  6. Everybody dance now. In Proceedings of the IEEE/CVF international conference on computer vision, pages 5933–5942, 2019.
  7. Videocrafter1: Open diffusion models for high-quality video generation. arXiv preprint arXiv:2310.19512, 2023.
  8. An image is worth 16x16 words: Transformers for image recognition at scale. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021.
  9. Structure and content-guided video synthesis with diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7346–7356, 2023.
  10. Generative adversarial nets. Advances in neural information processing systems, 27, 2014.
  11. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv preprint arXiv:2307.04725, 2023.
  12. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
  13. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022a.
  14. Video diffusion models. In NeurIPS, 2022b.
  15. Cogvideo: Large-scale pretraining for text-to-video generation via transformers. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023.
  16. Image quality metrics: Psnr vs. ssim. In 2010 20th international conference on pattern recognition, pages 2366–2369. IEEE, 2010.
  17. Composer: Creative and controllable image synthesis with composable conditions. In International Conference on Machine Learning, 2023.
  18. Learning high fidelity depths of dressed humans by watching social media dance videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12753–12762, 2021.
  19. Dreampose: Fashion video synthesis with stable diffusion. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 22680–22690, 2023.
  20. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4401–4410, 2019.
  21. Text2video-zero: Text-to-image diffusion models are zero-shot video generators. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 15954–15964, 2023.
  22. Auto-encoding variational bayes. In 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings, 2014.
  23. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 4015–4026, 2023.
  24. Follow your pose: Pose-guided text-to-video generation using pose-free videos. arXiv preprint arXiv:2304.01186, 2023.
  25. T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. arXiv preprint arXiv:2302.08453, 2023.
  26. Conditional image-to-video generation with latent flow diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18444–18455, 2023.
  27. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. In International Conference on Machine Learning, 2021.
  28. Fatezero: Fusing attentions for zero-shot text-based video editing. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 15932–15942, 2023.
  29. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  30. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022.
  31. Deep spatial transformation for pose-guided person image generation and animation. IEEE Transactions on Image Processing, 29:8622–8635, 2020.
  32. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
  33. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pages 234–241. Springer, 2015.
  34. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35:36479–36494, 2022.
  35. First order motion model for image animation. Advances in neural information processing systems, 32, 2019a.
  36. Animating arbitrary objects via deep motion transfer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2377–2386, 2019b.
  37. Motion representations for articulated animation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13653–13662, 2021.
  38. Make-a-video: Text-to-video generation without text-video data. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023.
  39. Denoising diffusion implicit models. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021.
  40. Objectstitch: Object compositing with diffusion model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18310–18319, 2023.
  41. Edge: Editable dance generation from music. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 448–458, 2023.
  42. Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:1812.01717, 2018.
  43. Neural discrete representation learning. Advances in neural information processing systems, 30, 2017.
  44. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  45. Disco: Disentangled control for referring human dance generation in real world. arXiv preprint arXiv:2307.00040, 2023a.
  46. Videocomposer: Compositional video synthesis with motion controllability. In Thirty-seventh Conference on Neural Information Processing Systems, 2023b.
  47. Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:2305.03989, 2023c.
  48. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4):600–612, 2004.
  49. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7623–7633, 2023.
  50. Paint by example: Exemplar-based image editing with diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18381–18391, 2023a.
  51. Rerender a video: Zero-shot text-guided video-to-video translation. arXiv preprint arXiv:2306.07954, 2023b.
  52. Effective whole-body pose estimation with two-stages distillation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4210–4220, 2023c.
  53. Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models. arXiv preprint arXiv:2308.06721, 2023.
  54. Bidirectionally deformable motion modulation for video-based human pose transfer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7502–7512, 2023.
  55. Dwnet: Dense warp-based network for pose-guided human video generation. arXiv preprint arXiv:1910.09139, 2019.
  56. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3836–3847, 2023a.
  57. Exploring dual-task correlation for pose guided person image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7713–7722, 2022.
  58. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018.
  59. I2vgen-xl: High-quality image-to-video synthesis via cascaded diffusion models. arXiv preprint arXiv:2311.04145, 2023b.
  60. Thin-plate spline motion model for image animation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3657–3666, 2022.
  61. Tryondiffusion: A tale of two unets. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4606–4615, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Li Hu (27 papers)
  2. Xin Gao (208 papers)
  3. Peng Zhang (642 papers)
  4. Ke Sun (136 papers)
  5. Bang Zhang (33 papers)
  6. Liefeng Bo (84 papers)
Citations (202)

Summary

Overview

The advancement of character animation has progressed to a new level with the development of a method that transforms still images into animated videos. This method focuses on ensuring that detailed features from a character's appearance are maintained accurately and consistently in video sequences. The method leverages the capabilities of diffusion models, which have been leading the way in high-quality image and video generation.

Methodology

At the core of this approach are key components that ensure consistency, control, and continuity:

  • ReferenceNet: This network captures the spatial details from a reference image, allowing the system to maintain a consistent appearance for the character throughout the animation.
  • Pose Guider: This acts as a control feature, efficiently directing the character's movements in accordance with a provided sequence of poses.
  • Temporal Layer: For stability across video frames, the temporal layer models relationships between multiple frames to support smooth, continuous motion.

These integrated features operate within a novel network structure that derives its base design and pretrained weights from the Stable Diffusion model. This base is then modified to manage multi-frame inputs and to enhance the preservation of details using spatial attention mechanisms.

Training and Evaluation

The training of the model employs a two-stage process:

  1. Initial Training: Individual video frames are used as input, without temporal information, to fine-tune the model's ability to generate high-quality images that are consistent with a given reference image and target pose.
  2. Temporal Layer Training: A video sequence approach is applied, training the model to handle frame-to-frame continuity smoothly.

The effectiveness of the method is demonstrated across various character video clips, including full-body humans and cartoon characters. It also outperforms rival methods in specific benchmarks such as fashion video synthesis and human dance generation. The versatility of the method is further confirmed by its application to two external datasets, each with unique challenges, where it achieved state-of-the-art results.

Limitations and Conclusion

Despite its successes, the method has its limitations. It can produce less stable results for quickly moving parts like hands and may struggle in generating unseen parts of the character from the provided image perspective. Moreover, its operational efficiency is lower compared to non-diffusion-model-based approaches due to the use of a denoising diffusion probabilistic model.

In summary, this character animation method, named Animate Anyone, introduces a robust framework for producing controllable, consistent, and continuous image-to-video synthesis. It holds promise as a foundation for numerous creative applications in the domain of image-to-video tasks.

Github Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com