HyperReenact: One-Shot Reenactment via Jointly Learning to Refine and Retarget Faces (2307.10797v1)

Published 20 Jul 2023 in cs.CV

Abstract: In this paper, we present our method for neural face reenactment, called HyperReenact, that aims to generate realistic talking head images of a source identity, driven by a target facial pose. Existing state-of-the-art face reenactment methods train controllable generative models that learn to synthesize realistic facial images, yet producing reenacted faces that are prone to significant visual artifacts, especially under the challenging condition of extreme head pose changes, or requiring expensive few-shot fine-tuning to better preserve the source identity characteristics. We propose to address these limitations by leveraging the photorealistic generation ability and the disentangled properties of a pretrained StyleGAN2 generator, by first inverting the real images into its latent space and then using a hypernetwork to perform: (i) refinement of the source identity characteristics and (ii) facial pose re-targeting, eliminating this way the dependence on external editing methods that typically produce artifacts. Our method operates under the one-shot setting (i.e., using a single source frame) and allows for cross-subject reenactment, without requiring any subject-specific fine-tuning. We compare our method both quantitatively and qualitatively against several state-of-the-art techniques on the standard benchmarks of VoxCeleb1 and VoxCeleb2, demonstrating the superiority of our approach in producing artifact-free images, exhibiting remarkable robustness even under extreme head pose changes. We make the code and the pretrained models publicly available at: https://github.com/StelaBou/HyperReenact .

Citations (27)

View on Semantic Scholar

Summary

The paper introduces a hypernetwork method that refines latent representations to achieve one-shot facial reenactment with minimal artifacts and superior identity preservation.
It integrates a reenactment module that fuses source appearance with target facial poses, dynamically adapting StyleGAN2 weights for realistic animation.
Benchmark experiments on VoxCeleb datasets demonstrate improved cosine similarity, FID, LPIPS, APD, and AED metrics relative to state-of-the-art methods.

Analysis of "HyperReenact: One-Shot Reenactment via Jointly Learning to Refine and Retarget Faces"

The paper introduces a novel approach to neural face reenactment, termed HyperReenact. The primary goal of this work is to generate realistic talking head sequences by taking a source identity and driving it with a target facial pose, which includes both 3D head orientation and facial expressions. This task is particularly challenging when operating in a one-shot setting, where only a single source frame is available, without the luxury of few-shot fine-tuning.

Methodology

HyperReenact leverages a pretrained StyleGAN2 generator, known for its photorealistic generation capabilities and ability to disentangle image attributes. The method begins by inverting real images into StyleGAN2’s latent space. Subsequently, a hypernetwork refines these latent representations by performing two crucial tasks: identity characteristic refinement and facial pose retargeting. This approach bypasses the dependence on external editing methods which often introduce visual artifacts, particularly under conditions of extreme divergence between the source and target images.

The architectural design incorporates several innovative components:

Reenactment Module (RM): A module that fuses appearance features from the source with pose features from the target, enhancing the ability to retain source identity while adopting target expressions.
Hypernetwork-based Adaptation: Adjusts the weights of the StyleGAN2 generator dynamically, aiding in artifact-free reenactment.

Experimental Results

The experiments benchmark HyperReenact against prominent state-of-the-art methods such as Fast BL, PIRenderer, and others on standard datasets like VoxCeleb1 and VoxCeleb2. The results, both quantitative and qualitative, underscore HyperReenact's superiority in generating realistic, identity-preserving facial images with minimal artifacts, even under broad pose variations. Specifically:

Identity Preservation: HyperReenact achieved the highest cosine similarity scores, reflecting its robust preservation of the source's identity.
Image Quality: The approach demonstrated competitive FID and LPIPS scores, indicative of high-quality, perceptually convincing outputs.
Pose Transfer: The method effectively captured and transferred target poses without significant distortion, evidenced by low Average Pose Distance (APD) and Average Expression Distance (AED) metrics.

Implications and Future Directions

This work underscores the potential of using hypernetwork-based techniques in tandem with StyleGAN2's capabilities for high-fidelity facial reenactment. Practically, the approach can significantly impact areas like virtual reality and facial animation in media. On a theoretical plane, this contribution prompts further exploration into end-to-end learning mechanisms capable of synthesizing complex motion dynamics from minimal samples.

Future research might delve into extending this model's application to dynamic and spontaneous facial reenactments in real time, perhaps by integrating temporal coherence constraints. Furthermore, investigating ways to attenuate computational demands while expanding applicability to higher resolution or more diverse datasets could advance the utility of such frameworks.

Overall, HyperReenact presents a substantive advancement in neural face reenactment, bridging the gap between static, identity-preserving frame synthesis and the seamless generation of complex, interactive facial sequences.

PDF Markdown

Related Papers

GitHub

GitHub - StelaBou/HyperReenact: Authors official PyTorch implementation of the "HyperReenact: One-Shot Reenactment via Jointly Learning to Refine and Retarget Faces" [ICCV 2023]. (85 stars)

YouTube

Show All Videos