Media2Face: Co-speech Facial Animation Generation With Multi-Modality Guidance (2401.15687v2)

Published 28 Jan 2024 in cs.CV and cs.GR

Abstract: The synthesis of 3D facial animations from speech has garnered considerable attention. Due to the scarcity of high-quality 4D facial data and well-annotated abundant multi-modality labels, previous methods often suffer from limited realism and a lack of lexible conditioning. We address this challenge through a trilogy. We first introduce Generalized Neural Parametric Facial Asset (GNPFA), an efficient variational auto-encoder mapping facial geometry and images to a highly generalized expression latent space, decoupling expressions and identities. Then, we utilize GNPFA to extract high-quality expressions and accurate head poses from a large array of videos. This presents the M2F-D dataset, a large, diverse, and scan-level co-speech 3D facial animation dataset with well-annotated emotional and style labels. Finally, we propose Media2Face, a diffusion model in GNPFA latent space for co-speech facial animation generation, accepting rich multi-modality guidances from audio, text, and image. Extensive experiments demonstrate that our model not only achieves high fidelity in facial animation synthesis but also broadens the scope of expressiveness and style adaptability in 3D facial animation.

Citations (14)

View on Semantic Scholar

Summary

The paper introduces a diffusion-based model that synchronizes 3D facial animations with multi-modal inputs including audio, text, and images.
It leverages a variational auto-encoder (GNPFA) to decouple facial expressions from identity and compiles the extensive Media2Face Dataset (M2F-D).
The model achieves improved lip synchronization and dynamic expression metrics, enabling real-time performance at over 300 frames per second.

Analysis of "Media2Face: Co-speech Facial Animation Generation With Multi-Modality Guidance"

The paper "Media2Face: Co-speech Facial Animation Generation With Multi-Modality Guidance" by Qingcheng Zhao et al. presents a novel diffusion-based model designed to generate realistic 3D facial animations synchronized with audio. The authors address challenges related to the scarcity of diverse and high-quality facial animation datasets by introducing an innovative approach that combines generative models with multi-modal data extraction techniques.

To tackle the issue of limited data availability, the paper introduces the Generalized Neural Parametric Facial Asset (GNPFA), a variational auto-encoder. This model serves as a robust representation of facial geometries, mapping them onto a latent space that decouples expression from identity. The model's ability to disentangle facial nuances allows it to effectively extract and render high-fidelity facial expressions from abundant 2D video datasets, thereby overcoming the limitations inherent in small-scale datasets.

The authors leverage GNPFA to compile the Media2Face Dataset (M2F-D), a diverse and extensive dataset featuring over 60 hours of scanned-level 4D facial animations. Notably, this dataset encompasses various emotional states and speaking styles, enhancing the breadth of expressions beyond previously existing datasets like VOCASET and BIWI.

Media2Face employs a diffusion-based model within the GNPFA latent space to generate 3D facial animations. By integrating multi-modal guide signals such as audio, text, and image, the system can adaptively modulate facial expressions and head motions in response to diverse inputs. This model architecture notably utilizes both Wav2Vec2 for audio feature extraction and CLIP for encoding text and image prompts, enabling seamless modulation of animation styles. The resulting facial animations exhibit both a high level of fidelity to the speech content and the expressive diversity as determined by multi-modal cues.

In terms of quantitative performance, the paper positions Media2Face favorably against existing methods, citing improvements in metrics such as Lip Vertex Error (LVE) and Facial Dynamics Deviation (FDD). These metrics confirm the model's proficiency in generating accurate lip synchronization and dynamic facial expressions.

Additionally, the paper highlights the capacity of Media2Face for real-time applications, supported by an efficient inference process capable of producing results at over 300 frames per second. The model's competence is further illustrated through its diverse applications, including generating stylized animations from textual and visual style prompts, as well as context-driven animations for dialogue and musical performances.

Future implications of this research point to its potential applications in virtual reality, human-computer interaction, and the gaming industry, where realistic and adaptive facial animation is crucial. The latent representation from GNPFA also proposes new directions for exploring personalized avatars and interactive virtual companions, expanding potential use cases in personalized entertainment and virtual communication platforms.

Overall, "Media2Face" stands as a significant contribution to the domain of 3D facial animation, merging the capabilities of diffusion models and multi-modal data understanding to achieve sophisticated and lifelike animative results. As multi-modal guidance continues to develop, future iterations of this research may further refine the synthesis of natural animations, implicating broader advancements in AI-driven virtual experiences.

PDF Markdown

Related Papers

Tweets

https://twitter.com/Mr_AllenT/status/1752674406759801106

https://twitter.com/koconder/status/1754623258400587800

YouTube

Show All Videos