DynamicFace: High-Quality and Consistent Video Face Swapping using Composable 3D Facial Priors (2501.08553v1)

Published 15 Jan 2025 in cs.CV

Abstract: Face swapping transfers the identity of a source face to a target face while retaining the attributes like expression, pose, hair, and background of the target face. Advanced face swapping methods have achieved attractive results. However, these methods often inadvertently transfer identity information from the target face, compromising expression-related details and accurate identity. We propose a novel method DynamicFace that leverages the power of diffusion model and plug-and-play temporal layers for video face swapping. First, we introduce four fine-grained face conditions using 3D facial priors. All conditions are designed to be disentangled from each other for precise and unique control. Then, we adopt Face Former and ReferenceNet for high-level and detailed identity injection. Through experiments on the FF++ dataset, we demonstrate that our method achieves state-of-the-art results in face swapping, showcasing superior image quality, identity preservation, and expression accuracy. Besides, our method could be easily transferred to video domain with temporal attention layer. Our code and results will be available on the project page: https://dynamic-face.github.io/

Summary

The paper introduces DynamicFace, a novel method for high-quality and consistent video face swapping leveraging diffusion models and composable 3D facial priors.
DynamicFace uses four disentangled 3D facial conditions (background, shape normal, expression landmark, identity-removed UV texture) to achieve precise control over face attributes.
Evaluations demonstrate DynamicFace achieves state-of-the-art results on FaceForensics++, showing superior identity preservation (higher ID Retrieval score) and temporal consistency in videos.

DynamicFace: High-Quality and Consistent Video Face Swapping using Composable 3D Facial Priors

The paper introduces a novel methodology, termed DynamicFace, aimed at ameliorating the limitations inherent in existing face-swapping techniques, especially in the video domain, by employing diffusion models and composable 3D facial priors. Face swapping involves transferring the identity from a source face to a target face while retaining other attributes, such as expression and background, of the target. Current methods, despite their advances, often compromise on consistent identity preservation and expression-related details due to inadvertent transfer of target identity features.

DynamicFace leverages the diffusion model's capabilities alongside a suite of plug-and-play temporal layers for video processing. The method innovatively introduces four fine-grained facial conditions using 3D facial priors to ensure comprehensive control over the face swapping process. These conditions are designed to be disentangled, allowing for precise manipulation. The authors utilize a multiple conditioning mechanism to separate the face into these conditions: background, shape-aware normal map, expression-related landmark, and identity-removed UV texture map. This comprehensive decomposition enables high fidelity in maintaining the desired identity and non-identity attributes.

Key to DynamicFace's architecture is the augmentation of Stable Diffusion models with temporal attention mechanisms, allowing seamless transition from image to video face swapping. For identity preservation, DynamicFace incorporates modules such as Face Former and ReferenceNet. These modules aid in embedding high-level and detailed identity information, ensuring consistent identity across different frames and poses.

The methodology was rigorously evaluated using the FaceForensics++ dataset, showcasing state-of-the-art results with superior image quality, identity preservation, and expression accuracy. Notably, the method demonstrates robustness in preserving identity even amidst varied expressions and poses, while achieving high temporal consistency in video face swaps.

Quantitative results affirm that DynamicFace surpasses prior face-swapping models across several metrics. It achieves a higher ID Retrieval score, indicating enhanced identity preservation. Furthermore, it maintains lower pose and expression errors, underscoring its capability to accurately replicate target facial attributes. Video-level evaluations indicate superior temporal consistency, reinforcing its applicability in video contexts.

The implications of this research are multi-faceted. Practically, it facilitates face swapping in applications ranging from film production to virtual reality, where maintaining identity integrity over dynamic scenes is critical. Theoretically, it underscores the effectiveness of leveraging 3D facial priors and diffusion models in generative tasks, potentially guiding future developments in AI-driven visual synthesis.

The research points towards future avenues such as refining the disentanglement of facial features for even more precise control, and extending the approach to handle more complex scenes involving multiple faces or varied lighting conditions. The diffusion-based architecture exemplifies a promising direction for addressing the trade-offs between identity preservation and motion fidelity in face-swapping endeavors.

PDF Markdown

Related Papers

GitHub

DynamicFace

Tweets

https://twitter.com/jack_r_saunders/status/1879826596367888579