Real3D-Portrait: One-shot Realistic 3D Talking Portrait Synthesis (2401.08503v3)

Published 16 Jan 2024 in cs.CV

Abstract: One-shot 3D talking portrait generation aims to reconstruct a 3D avatar from an unseen image, and then animate it with a reference video or audio to generate a talking portrait video. The existing methods fail to simultaneously achieve the goals of accurate 3D avatar reconstruction and stable talking face animation. Besides, while the existing works mainly focus on synthesizing the head part, it is also vital to generate natural torso and background segments to obtain a realistic talking portrait video. To address these limitations, we present Real3D-Potrait, a framework that (1) improves the one-shot 3D reconstruction power with a large image-to-plane model that distills 3D prior knowledge from a 3D face generative model; (2) facilitates accurate motion-conditioned animation with an efficient motion adapter; (3) synthesizes realistic video with natural torso movement and switchable background using a head-torso-background super-resolution model; and (4) supports one-shot audio-driven talking face generation with a generalizable audio-to-motion model. Extensive experiments show that Real3D-Portrait generalizes well to unseen identities and generates more realistic talking portrait videos compared to previous methods. Video samples and source code are available at https://real3dportrait.github.io .

Authors (14)

Zhenhui Ye (25 papers)
Tianyun Zhong (9 papers)
Yi Ren (215 papers)
Jiaqi Yang (107 papers)
Weichuang Li (5 papers)
Jiawei Huang (60 papers)
Ziyue Jiang (38 papers)
Rongjie Huang (62 papers)
Jinglin Liu (38 papers)
Chen Zhang (404 papers)
Xiang Yin (99 papers)
Zejun Ma (78 papers)
Zhou Zhao (219 papers)
JinZheng He (22 papers)

Citations (25)

View on Semantic Scholar

Summary

The paper introduces an enhanced 3D reconstruction method leveraging a large Image-to-Plane model that generalizes well to unseen identities.
The paper presents a motion adapter that predicts residual motion diff-planes to enable accurate audio-driven and motion-conditioned animations.
The paper achieves natural full portrait synthesis by separately modeling head, torso, and background with super-resolution to ensure realistic integration.

Real3D-Portrait: One-shot Realistic 3D Talking Portrait Synthesis

Real3D-Portrait presents a framework for one-shot 3D talking portrait generation. This paper addresses the shortcomings of existing methods by developing an approach that accurately reconstructs 3D avatars from unseen images and provides stable talking face animation.

The primary contributions of this work are as follows:

Enhanced 3D Reconstruction: Real3D-Portrait leverages a large Image-to-Plane (I2P) model, pre-trained to distill 3D prior knowledge from a 3D face generative model. This approach enhances the generalizability and quality of 3D reconstruction, particularly for new identities. The paper notes that prior methods often overfit specific identities, whereas the proposed I2P model maintains high fidelity without extensive individual training.
Accurate Motion-Conditioned Animation: The motion adapter, introduced in this work, efficiently morphs the reconstructed 3D representation based on input conditions such as motion or audio. This is achieved by predicting a residual motion diff-plane using a segmented representation known as PNCC.
Natural Synthesis of Full Portraits: Unlike previous efforts that focused primarily on the head, this framework models the head, torso, and background separately but cohesively, producing realistic movement and switchable backgrounds. This is managed via a Head-Torso-Background Super-Resolution (HTB-SR) model.
Audio-Driven Generation: A generic audio-to-motion model enables audio-driven face generation. This feature supports facial expressions driven by audio, integrating seamlessly with video synthesis to maintain quality across various scenarios.

Experimental Results

The paper reports extensive experiments demonstrating that Real3D-Portrait significantly improves upon previous one-shot systems in terms of realism and identity preservation. The framework achieves superior results in several metrics, including CSIM and FID, compared to state-of-the-art methods.

Implications and Future Work

The framework marks a notable step in realistic 3D talking portrait synthesis, providing both practical applicability in areas like VR and potential integration into immersive media. However, the paper identifies limitations such as performance under extreme poses, suggesting the need for further developments in data augmentation and architecture refinement.

Future explorations might involve incorporating large-posed datasets and refining the tri-plane representation. Moreover, to fully capitalize on these advancements, the research could delve into few-shot learning to enhance the adaptability to new identities, boosting both visual quality and identity preservation.

Conclusion

Real3D-Portrait sets a new standard for 3D talking portrait synthesis by integrating novel approaches to avatar reconstruction, animation accuracy, and environmental realism. While addressing previous limitations, it lays the groundwork for future innovations and practical applications, marking a significant contribution to AI-driven media synthesis.

PDF Markdown