Controllable Talking Face Generation by Implicit Facial Keypoints Editing (2406.02880v2)

Published 5 Jun 2024 in cs.CV and cs.AI

Abstract: Audio-driven talking face generation has garnered significant interest within the domain of digital human research. Existing methods are encumbered by intricate model architectures that are intricately dependent on each other, complicating the process of re-editing image or video inputs. In this work, we present ControlTalk, a talking face generation method to control face expression deformation based on driven audio, which can construct the head pose and facial expression including lip motion for both single image or sequential video inputs in a unified manner. By utilizing a pre-trained video synthesis renderer and proposing the lightweight adaptation, ControlTalk achieves precise and naturalistic lip synchronization while enabling quantitative control over mouth opening shape. Our experiments show that our method is superior to state-of-the-art performance on widely used benchmarks, including HDTF and MEAD. The parameterized adaptation demonstrates remarkable generalization capabilities, effectively handling expression deformation across same-ID and cross-ID scenarios, and extending its utility to out-of-domain portraits, regardless of languages. Code is available at https://github.com/NetEase-Media/ControlTalk.

Summary

The paper demonstrates that ControlTalk simplifies talking face generation by editing implicit facial keypoints, reducing model complexity.
Its novel Audio2Exp module effectively translates audio cues and original expressions into enhanced keypoints for highly accurate lip synchronization.
Extensive evaluations on HDTF and MEAD benchmarks show that ControlTalk outperforms existing methods in both lip-sync accuracy and overall image quality.

Controllable Talking Face Generation by Implicit Facial Keypoints Editing

The paper entitled "Controllable Talking Face Generation by Implicit Facial Keypoints Editing" presents a novel approach named ControlTalk for generating talking faces driven by audio inputs. The premise of this research addresses common issues within the domain, such as the complexity of existing models and the difficulty in re-editing image or video inputs. The proposed method, ControlTalk, aims to streamline these processes through a unified framework that operates on both single image and video inputs.

Introduction to ControlTalk

ControlTalk leverages a pre-trained video synthesis renderer and introduces a lightweight adaptation method, enabling precise and naturalistic lip synchronization. This approach allows for quantitative control over mouth opening shapes and demonstrates superior performance on benchmarks like HDTF and MEAD. By deploying parameterized adaptation, ControlTalk showcases impressive generalization capabilities, effectively handling expression deformation across various scenarios, including same-ID, cross-ID, and out-of-domain portraits, irrespective of the language spoken.

Methodological Insights

Key contributions of this research include:

Simplified Generation Process: ControlTalk edits parameterized facial keypoints for efficient talking face generation, minimizing the complexity typically associated with such tasks.
Flexible Input Handling: The method supports both images and videos as input, enhancing its versatility and potential application scope.
Lip Synchronization and Expression Control: Through a novel Audio2Exp module, ControlTalk achieves accurate lip synchronization by mapping audio and original facial expressions to enhanced expression points, which are then rendered into talking face images.

Technical Framework

The architecture of ControlTalk integrates several components:

A pre-trained audio encoder extracts speech features.
A video synthesis renderer (face-vid2vid) captures face motion and parameterized face rendering.
The Audio2Exp module predicts expression coefficients by progressively learning audio-driven expression changes.

The training process utilizes a combination of perceptual loss and lip-sync loss to ensure high-quality, synchronized video outputs. The perceptual loss focuses on maintaining overall facial consistency, while the lip-sync loss targets accurate mouth movement alignment with the audio.

Experimental Evaluations

Extensive experiments validate the efficacy of ControlTalk over existing methods like Wav2Lip, DINet, DreamTalk, and SadTalker. Key findings include:

Performance Metrics: Notable improvements in lip synchronization accuracy (Sync metric) and comparable image quality (SSIM and FID metrics).
Qualitative Assessments: ControlTalk generates more realistic mouth shapes and facial expressions, maintaining character identities more accurately than other methods.

Flexibility and Generalization

ControlTalk demonstrates remarkable adaptability across various testing scenarios:

Character Versatility: The method applies to real humans, paintings, generated faces, and cartoon images without necessitating explicit re-training for each category.
Language Independence: The model generalizes well to multiple languages, highlighting its robustness beyond the English training dataset.
Resolution Scalability: ControlTalk supports higher resolution inputs (e.g., 512x512) by leveraging the flexibility of the pre-trained renderer, showcasing its potential for ultra-high-definition video applications.

Conclusion and Future Directions

ControlTalk represents a significant advancement in the field of audio-driven talking face generation, offering a streamlined yet versatile solution capable of handling diverse inputs and scenarios. Future research could explore further optimization of the Audio2Exp module, integration with real-time video processing frameworks, and expansion into more complex facial dynamics and expressions. The adaptability of ControlTalk to higher resolution and varied character inputs positions it well for a broad range of applications in digital entertainment, virtual conferencing, and beyond.

PDF Markdown

Related Papers

Tweets

https://twitter.com/mctalentowen/status/1799323179883864165

https://twitter.com/CSVisionPapers/status/1798685444541993329