StyleTalk: One-shot Talking Head Generation with Controllable Speaking Styles (2301.01081v2)

Published 3 Jan 2023 in cs.CV

Abstract: Different people speak with diverse personalized speaking styles. Although existing one-shot talking head methods have made significant progress in lip sync, natural facial expressions, and stable head motions, they still cannot generate diverse speaking styles in the final talking head videos. To tackle this problem, we propose a one-shot style-controllable talking face generation framework. In a nutshell, we aim to attain a speaking style from an arbitrary reference speaking video and then drive the one-shot portrait to speak with the reference speaking style and another piece of audio. Specifically, we first develop a style encoder to extract dynamic facial motion patterns of a style reference video and then encode them into a style code. Afterward, we introduce a style-controllable decoder to synthesize stylized facial animations from the speech content and style code. In order to integrate the reference speaking style into generated videos, we design a style-aware adaptive transformer, which enables the encoded style code to adjust the weights of the feed-forward layers accordingly. Thanks to the style-aware adaptation mechanism, the reference speaking style can be better embedded into synthesized videos during decoding. Extensive experiments demonstrate that our method is capable of generating talking head videos with diverse speaking styles from only one portrait image and an audio clip while achieving authentic visual effects. Project Page: https://github.com/FuxiVirtualHuman/styletalk.

Authors (8)

Yifeng Ma (9 papers)
Suzhen Wang (16 papers)
Zhipeng Hu (38 papers)
Changjie Fan (79 papers)
Tangjie Lv (35 papers)
Yu Ding (70 papers)
Zhidong Deng (22 papers)
Xin Yu (192 papers)

Citations (69)

View on Semantic Scholar

Summary

The paper's main contribution is the StyleTalk framework that uses a transformer-based encoder and a style-controllable decoder for generating realistic talking head videos with customizable speaking styles.
The framework synchronizes lip movements and facial expressions by leveraging dual pathways for upper and lower facial components, ensuring enhanced expression accuracy and natural motion.
Empirical evaluations demonstrate that StyleTalk outperforms existing methods in identity preservation, facial expression fidelity, and audio-visual alignment.

One-Shot Talking Head Generation with StyleTalk Framework

The paper "StyleTalk: One-Shot Talking Head Generation with Controllable Speaking Styles" introduces a framework designed to enhance the generation of talking head videos by rendering diverse speaking styles. This approach aims to overcome limitations in existing one-shot talking head generation methods that primarily focus on lip synchronization, natural facial expressions, and head motion stability, but lack the ability to create varied speaking styles among different personas.

Methodology and Framework

StyleTalk Framework Components:

Style Encoder: This component extracts dynamic facial motion patterns from a reference video, encoding them into a style code. Utilizing a transformer-based encoder, this module effectively models temporal dynamics, a critical advancement over prior methods that employ static, frame-by-frame expression transfer.
Style-Controllable Decoder: A pivotal aspect of this framework is the style-controllable dynamic decoder, which synthesizes facial animations from audio content and style codes. The decoder integrates a style-aware adaptive transformer, enhancing the ability to encode diverse style codes into video generation, ensuring the output reflects specified speaking styles.
Audio Encoder and Image Renderer: These components ensure synchronized lip movements with input audio and convert encoded facial expressions into visual output, respectively. The inclusion of a dual-pathway for upper and lower facial components, reflecting their distinct temporal dynamics, marks a significant step in achieving more nuanced and authentic expressions.

Results and Implications

The results show that StyleTalk successfully achieves photo-realistic video outputs with varied speaking styles from just a single reference image and an audio file. Empirical evaluations indicate superior performance over existing methods in both subjective assessments and objective metrics such as F-LMD, M-LMD, and Sync confidence scores.

Key Findings:

Accuracy and Fidelity: StyleTalk demonstrates superior identity preservation, realistic facial expression generation, and enhanced lip synchronization compared to existing methods.
Diverse Style Embedding: The framework supports style variability, leading to convincing representations across different speaking styles while maintaining audio-visual alignment.

These advances imply significant potential in practical applications such as virtual human creation, where personalized avatars can exhibit tailored expressive styles. Additionally, the framework's ability to interpolate between styles enables customizable emotional intensities and the generation of new speaking patterns.

Conclusion and Future Directions

This research outlines a structured approach to integrating personalized dynamics into one-shot talking head generation, representing a substantial step forward in the domain of AI-based video synthesis. Future work may explore expanding the dataset range for training, enhancing style interpolation granularity, or extending style control to more nuanced emotional and contextual influences.

Overall, StyleTalk constitutes an effective solution for producing stylized talking videos, promising advancements in and applications for both entertainment and human-computer interaction platforms.

PDF Markdown

Related Papers

GitHub

GitHub - FuxiVirtualHuman/styletalk (501 stars)