Animate Anyone: Consistent and Controllable Image-to-Video Synthesis for Character Animation (2311.17117v3)

Published 28 Nov 2023 in cs.CV

Abstract: Character Animation aims to generating character videos from still images through driving signals. Currently, diffusion models have become the mainstream in visual generation research, owing to their robust generative capabilities. However, challenges persist in the realm of image-to-video, especially in character animation, where temporally maintaining consistency with detailed information from character remains a formidable problem. In this paper, we leverage the power of diffusion models and propose a novel framework tailored for character animation. To preserve consistency of intricate appearance features from reference image, we design ReferenceNet to merge detail features via spatial attention. To ensure controllability and continuity, we introduce an efficient pose guider to direct character's movements and employ an effective temporal modeling approach to ensure smooth inter-frame transitions between video frames. By expanding the training data, our approach can animate arbitrary characters, yielding superior results in character animation compared to other image-to-video methods. Furthermore, we evaluate our method on benchmarks for fashion video and human dance synthesis, achieving state-of-the-art results.

PDF HTML Abstract

Overview

The advancement of character animation has progressed to a new level with the development of a method that transforms still images into animated videos. This method focuses on ensuring that detailed features from a character's appearance are maintained accurately and consistently in video sequences. The method leverages the capabilities of diffusion models, which have been leading the way in high-quality image and video generation.

Methodology

At the core of this approach are key components that ensure consistency, control, and continuity:

ReferenceNet: This network captures the spatial details from a reference image, allowing the system to maintain a consistent appearance for the character throughout the animation.
Pose Guider: This acts as a control feature, efficiently directing the character's movements in accordance with a provided sequence of poses.
Temporal Layer: For stability across video frames, the temporal layer models relationships between multiple frames to support smooth, continuous motion.

These integrated features operate within a novel network structure that derives its base design and pretrained weights from the Stable Diffusion model. This base is then modified to manage multi-frame inputs and to enhance the preservation of details using spatial attention mechanisms.

Training and Evaluation

The training of the model employs a two-stage process:

Initial Training: Individual video frames are used as input, without temporal information, to fine-tune the model's ability to generate high-quality images that are consistent with a given reference image and target pose.
Temporal Layer Training: A video sequence approach is applied, training the model to handle frame-to-frame continuity smoothly.

The effectiveness of the method is demonstrated across various character video clips, including full-body humans and cartoon characters. It also outperforms rival methods in specific benchmarks such as fashion video synthesis and human dance generation. The versatility of the method is further confirmed by its application to two external datasets, each with unique challenges, where it achieved state-of-the-art results.

Limitations and Conclusion

Despite its successes, the method has its limitations. It can produce less stable results for quickly moving parts like hands and may struggle in generating unseen parts of the character from the provided image perspective. Moreover, its operational efficiency is lower compared to non-diffusion-model-based approaches due to the use of a denoising diffusion probabilistic model.

In summary, this character animation method, named Animate Anyone, introduces a robust framework for producing controllable, consistent, and continuous image-to-video synthesis. It holds promise as a foundation for numerous creative applications in the domain of image-to-video tasks.