ReliTalk: Relightable Talking Portrait Generation from a Single Video (2309.02434v1)

Published 5 Sep 2023 in cs.CV and cs.GR

Abstract: Recent years have witnessed great progress in creating vivid audio-driven portraits from monocular videos. However, how to seamlessly adapt the created video avatars to other scenarios with different backgrounds and lighting conditions remains unsolved. On the other hand, existing relighting studies mostly rely on dynamically lighted or multi-view data, which are too expensive for creating video portraits. To bridge this gap, we propose ReliTalk, a novel framework for relightable audio-driven talking portrait generation from monocular videos. Our key insight is to decompose the portrait's reflectance from implicitly learned audio-driven facial normals and images. Specifically, we involve 3D facial priors derived from audio features to predict delicate normal maps through implicit functions. These initially predicted normals then take a crucial part in reflectance decomposition by dynamically estimating the lighting condition of the given video. Moreover, the stereoscopic face representation is refined using the identity-consistent loss under simulated multiple lighting conditions, addressing the ill-posed problem caused by limited views available from a single monocular video. Extensive experiments validate the superiority of our proposed framework on both real and synthetic datasets. Our code is released in https://github.com/arthur-qiu/ReliTalk.

Citations (7)

View on Semantic Scholar

Summary

The paper introduces a novel framework that combines audio-driven portrait synthesis with relighting using single-view videos.
It leverages self-supervised decomposition of facial geometry and reflectance through 3D facial priors driven by audio cues.
Experimental results show superior perceptual and reconstruction metrics, enhancing applications in teleconferencing and VR.

Overview of "ReliTalk: Relightable Talking Portrait Generation from a Single Video"

The paper "ReliTalk: Relightable Talking Portrait Generation from a Single Video" presents a novel approach for generating relightable audio-driven talking portraits from monocular videos. In recent years, there has been significant progress in developing methods for creating audio-driven video avatars. However, these methods often struggle when it comes to adapting the created avatars to different lighting conditions and backgrounds seamlessly. Existing techniques for relighting often rely on computationally expensive multi-view or dynamically lit data, which is not feasible for widespread application.

The proposed framework, ReliTalk, addresses this gap by leveraging a self-supervised method that decomposes reflectance from learned audio-driven facial normals and images. ReliTalk extends the capability of audio-driven portrait generation by incorporating relighting features, only requiring data from single-view videos. It achieves this through a strategic decomposition of face geometry and reflectance, driven by 3D facial priors derived from auditory signals.

Key Contributions

The contributions of this paper are multifaceted:

Innovative Framework: This work introduces a novel framework that successfully combines relighting with audio-driven talking portrait generation, requiring only monocular video inputs.
Geometry and Reflectance Decomposition: ReliTalk decouples geometry from reflectance properties in the portrait images. This is made possible by using implicit functions to predict normal maps, which are calibrated against audio cues.
Mesh-aware Audio-to-Expression Translation: Utilizing mesh-aware guidance improves the robustness of lip synchronization for talking portraits, particularly when confronted with out-of-training-sample phonetic content.
Simulated Multi-Lighting Conditions: An identity-consistent loss under varied simulated lighting assists in mitigating the problems posed by the limited perspective data available from single-view simulations.
Effective Training Scheme: The authors present a well-structured pipeline and training strategy that enables the architecture to converge effectively, generating high fidelity portraits under unseen lighting conditions.

Experimental Evaluation

ReliTalk's effectiveness is demonstrated through extensive experiments on both synthetic and real-world datasets. The framework's ability to harmonize the lighting of the foreground portrait with any designated background is validated both qualitatively and quantitatively.

On synthetic datasets, ReliTalk outperforms existing methodologies, achieving superior perceptual and reconstruction accuracy. Quantitative metrics such as PSNR, SSIM, and LPIPS consistently favor the proposed architecture. In the context of real video datasets, the approach achieves notable improvements across all evaluated metrics compared to previous state-of-the-art methods.

Implications and Future Directions

The implications of ReliTalk's success are evident in its potential applicability to areas such as teleconferencing, virtual reality, video production, and interactive media where realistic avatar synthesis is desirable. By circumventing the limitations posed by traditional multi-view image acquisitions, this framework paves the way for more accessible and efficient avatar generation techniques.

Looking forward, future research directions could focus on enhancing the reflection models to handle more complex features such as fuzzy body elements and dynamic accessories like glasses or hats that introduce additional reflectance challenges. Moreover, the exploration of leveraging more sophisticated lighting models could further improve the photorealism of generated talking heads, broadening the range of practical applications.