- The paper introduces DynamicFace, a novel method for high-quality and consistent video face swapping leveraging diffusion models and composable 3D facial priors.
- DynamicFace uses four disentangled 3D facial conditions (background, shape normal, expression landmark, identity-removed UV texture) to achieve precise control over face attributes.
- Evaluations demonstrate DynamicFace achieves state-of-the-art results on FaceForensics++, showing superior identity preservation (higher ID Retrieval score) and temporal consistency in videos.
DynamicFace: High-Quality and Consistent Video Face Swapping using Composable 3D Facial Priors
The paper introduces a novel methodology, termed DynamicFace, aimed at ameliorating the limitations inherent in existing face-swapping techniques, especially in the video domain, by employing diffusion models and composable 3D facial priors. Face swapping involves transferring the identity from a source face to a target face while retaining other attributes, such as expression and background, of the target. Current methods, despite their advances, often compromise on consistent identity preservation and expression-related details due to inadvertent transfer of target identity features.
DynamicFace leverages the diffusion model's capabilities alongside a suite of plug-and-play temporal layers for video processing. The method innovatively introduces four fine-grained facial conditions using 3D facial priors to ensure comprehensive control over the face swapping process. These conditions are designed to be disentangled, allowing for precise manipulation. The authors utilize a multiple conditioning mechanism to separate the face into these conditions: background, shape-aware normal map, expression-related landmark, and identity-removed UV texture map. This comprehensive decomposition enables high fidelity in maintaining the desired identity and non-identity attributes.
Key to DynamicFace's architecture is the augmentation of Stable Diffusion models with temporal attention mechanisms, allowing seamless transition from image to video face swapping. For identity preservation, DynamicFace incorporates modules such as Face Former and ReferenceNet. These modules aid in embedding high-level and detailed identity information, ensuring consistent identity across different frames and poses.
The methodology was rigorously evaluated using the FaceForensics++ dataset, showcasing state-of-the-art results with superior image quality, identity preservation, and expression accuracy. Notably, the method demonstrates robustness in preserving identity even amidst varied expressions and poses, while achieving high temporal consistency in video face swaps.
Quantitative results affirm that DynamicFace surpasses prior face-swapping models across several metrics. It achieves a higher ID Retrieval score, indicating enhanced identity preservation. Furthermore, it maintains lower pose and expression errors, underscoring its capability to accurately replicate target facial attributes. Video-level evaluations indicate superior temporal consistency, reinforcing its applicability in video contexts.
The implications of this research are multi-faceted. Practically, it facilitates face swapping in applications ranging from film production to virtual reality, where maintaining identity integrity over dynamic scenes is critical. Theoretically, it underscores the effectiveness of leveraging 3D facial priors and diffusion models in generative tasks, potentially guiding future developments in AI-driven visual synthesis.
The research points towards future avenues such as refining the disentanglement of facial features for even more precise control, and extending the approach to handle more complex scenes involving multiple faces or varied lighting conditions. The diffusion-based architecture exemplifies a promising direction for addressing the trade-offs between identity preservation and motion fidelity in face-swapping endeavors.