Follow-Your-Emoji: Fine-Controllable and Expressive Freestyle Portrait Animation (2406.01900v3)

Published 4 Jun 2024 in cs.CV

Abstract: We present Follow-Your-Emoji, a diffusion-based framework for portrait animation, which animates a reference portrait with target landmark sequences. The main challenge of portrait animation is to preserve the identity of the reference portrait and transfer the target expression to this portrait while maintaining temporal consistency and fidelity. To address these challenges, Follow-Your-Emoji equipped the powerful Stable Diffusion model with two well-designed technologies. Specifically, we first adopt a new explicit motion signal, namely expression-aware landmark, to guide the animation process. We discover this landmark can not only ensure the accurate motion alignment between the reference portrait and target motion during inference but also increase the ability to portray exaggerated expressions (i.e., large pupil movements) and avoid identity leakage. Then, we propose a facial fine-grained loss to improve the model's ability of subtle expression perception and reference portrait appearance reconstruction by using both expression and facial masks. Accordingly, our method demonstrates significant performance in controlling the expression of freestyle portraits, including real humans, cartoons, sculptures, and even animals. By leveraging a simple and effective progressive generation strategy, we extend our model to stable long-term animation, thus increasing its potential application value. To address the lack of a benchmark for this field, we introduce EmojiBench, a comprehensive benchmark comprising diverse portrait images, driving videos, and landmarks. We show extensive evaluations on EmojiBench to verify the superiority of Follow-Your-Emoji.

Citations (20)

View on Semantic Scholar

Summary

The paper introduces expression-aware landmarks and a fine-grained loss to animate portraits precisely while preserving the subject's identity.
It employs a progressive generation strategy that interpolates keyframes to ensure temporal stability in extended animations.
Extensive experiments on EmojiBench show improvements in metrics such as SSIM and LPIPS, confirming superior animation quality.

Fine-Controllable and Expressive Freestyle Portrait Animation

The discussed paper introduces "Follow-Your-Emoji," a novel framework for portrait animation, leveraging diffusion-based models to achieve precise and expressive control over animated portraits. This innovation is particularly noteworthy in its ability to animate a variety of portrait styles, ranging from realistic human images to cartoons and even animal depictions. Traditional challenges in portrait animation, such as maintaining the original identity of the reference portrait while correctly animating target expressions and ensuring temporal consistency, are addressed through several key advancements.

Key Contributions

Expression-Aware Landmarks

A significant contribution of this work is the introduction of "expression-aware landmarks." These landmarks project 3D facial keypoints obtained from MediaPipe into a 2D plane, ensuring robust motion alignment between reference and target expressions. Specifically, this process excludes less reliable facial contours and incorporates pupil points to better capture intricate facial movements. This strategy significantly mitigates identity leakage—an issue where the generated animation diverges from the original identity of the reference portrait.

Facial Fine-Grained Loss

The paper also presents a novel facial fine-grained (FFG) loss. This loss function enhances the model's sensitivity to subtle expression changes and the detailed appearance of the reference portrait. By employing facial and expression masks—delineating regions around key facial landmarks and overall facial contours—this loss function ensures that the model focuses on these critical areas during the training process. Consequently, the generated animations exhibit high fidelity in terms of both identity preservation and expression accuracy.

Progressive Generation Strategy

To facilitate long-term animation synthesis, the authors propose a progressive generation strategy. By initially generating keyframes and progressively interpolating intermediate frames, the method ensures temporal stability and coherence over extended animations. This strategy effectively addresses the common problem of temporal artifacts in long animated sequences.

Experimental Validation

The authors validate their approach through extensive experiments using a new benchmark, EmojiBench, which they introduce. EmojiBench includes a wide range of portrait styles and driving videos, encompassing 410 various portraits and 45 driving videos with diverse facial expressions. This robust benchmark allows for comprehensive evaluation of the model's performance across different styles and expressions.

Quantitative metrics such as L1 error, SSIM, LPIPS, and FVD show significant improvements over existing methods. Additionally, user paper evaluations confirm the superior performance of Follow-Your-Emoji in terms of motion accuracy, identity preservation, and overall animation quality. The model demonstrates superior capabilities in handling exaggerated expressions, preserving identity even in cross-domain scenarios, and producing high-quality, temporally consistent animations.

Implications and Future Directions

The implications of this research are substantial, both in practical applications and theoretical advancements. Practically, Follow-Your-Emoji can be vastly beneficial in domains such as virtual conferencing, social media filters, and entertainment, where high-quality, customizable portrait animations are desirable. Theoretically, this work opens new avenues for further research into diffusion-based models for animation and the integration of 3D keypoints for more robust motion representation.

Future research could explore extending this approach to handle full-body animations or leveraging advances in neural rendering to enhance the photorealism of animated portraits further. Additionally, optimizing the computational efficiency of the proposed model would be beneficial for real-time applications.

In summary, the Follow-Your-Emoji framework represents a significant step forward in the field of portrait animation, offering fine-grained control and high expressiveness in generated animations while preserving the identity of reference portraits. The introduction of expression-aware landmarks and facial fine-grained loss are critical innovations that address longstanding challenges in this domain.

Related Papers

Tweets

https://twitter.com/HalimAlrasihi/status/1800668333060366593

https://twitter.com/XtraAi/status/1800907967090581755