Audio- and Gaze-driven Facial Animation of Codec Avatars (2008.05023v1)

Published 11 Aug 2020 in cs.CV

Abstract: Codec Avatars are a recent class of learned, photorealistic face models that accurately represent the geometry and texture of a person in 3D (i.e., for virtual reality), and are almost indistinguishable from video. In this paper we describe the first approach to animate these parametric models in real-time which could be deployed on commodity virtual reality hardware using audio and/or eye tracking. Our goal is to display expressive conversations between individuals that exhibit important social signals such as laughter and excitement solely from latent cues in our lossy input signals. To this end we collected over 5 hours of high frame rate 3D face scans across three participants including traditional neutral speech as well as expressive and conversational speech. We investigate a multimodal fusion approach that dynamically identifies which sensor encoding should animate which parts of the face at any time. See the supplemental video which demonstrates our ability to generate full face motion far beyond the typically neutral lip articulations seen in competing work: https://research.fb.com/videos/audio-and-gaze-driven-facial-animation-of-codec-avatars/

Citations (72)

View on Semantic Scholar

Summary

The paper introduces a real-time method that leverages audio and gaze inputs to animate photorealistic avatars without extensive calibration.
It employs a multimodal VAE architecture with separate encoders and a shared latent space to fuse high-quality audio and gaze data for precise facial expression reconstruction.
The approach achieves significant improvements in landmark error rates and lip closure scores, advancing avatar expressivity in virtual environments.

Overview of Audio- and Gaze-driven Facial Animation of Codec Avatars

The paper "Audio- and Gaze-driven Facial Animation of Codec Avatars" presents a method for animating photorealistic avatars using audio and gaze information, targeting deployment on commodity AR/VR hardware. Traditional methods for animating avatars are often limited by intricate hardware setups or require extensive user-specific calibration. This research adopts a novel approach that circumvents such limitations by leveraging multimodal signals—specifically audio and gaze—to animate avatars in real-time, which is significant for applications in virtual reality environments.

Technical Contributions

Data Collection and Fusion Approach: The authors collected over five hours of high frame rate 3D face scans from three participants. This dataset includes varied speech data (neutral, expressive, and conversational), providing a base to train models for natural facial animations. The paper focused on a multimodal fusion approach through a Variational Autoencoder (VAE) that dynamically determines the contributions of audio and gaze inputs to different facial movements, offering flexibility and accuracy in real-time animation.
Model Architecture: The paper introduces a multimodal VAE structure that integrates separate modality-specific encoders for audio and gaze with a shared latent space. This allows the model to preserve and leverage information from both inputs efficiently. By reconstructing input modalities alongside facial outputs, the model enhances fidelity in facial animation, producing believable and subtle expressions.
Real-Time Implementation: A significant contribution is the real-time processing achieved using non-linear photorealistic models that account for the complexity of facial texture and geometry, including lip articulation and tongue motion. This represents a step forward from previous models handling only geometry owing to the intricacies involved with texture variations.
Mitigating Modality Overfitting: The authors address the issue of multimodal networks being prone to ignore one input signal by introducing learning techniques that reconstruct input modalities during training. Their dynamic fusion model demonstrates that this technique improves expressiveness and authenticity in the animations.

Results and Evaluation

The model's performance is evaluated on landmark error rates and lip closure scores, both critical indicators of facial animation quality, showcasing significant improvements over a conventional regression baseline. Challenges in extrapolating facial expressions from neutral training data are mitigated by the diverse dataset, thereby producing both nuanced and vibrant animations. Moreover, the architecture dynamically utilizes gaze and audio signals to capture non-verbal cues such as smiles—a significant advancement for avatar expressivity.

The findings suggest that using mel spectrograms over phoneme representations provides richer audio features for facial animation, capturing subtle lip and tongue movements essential for natural expressions. The inclusion of gaze inputs also contributes to predicting expressions not directly associated with speech.

Implications and Future Directions

This research advances the field of animating photorealistic avatars, with implications for improving social communication in virtual environments devoid of traditional facial cues. Future directions include integrating video-driven models to enhance lip articulation and investigating the use of facial motion capture in virtual settings to further refine avatar expressivity. Ethical considerations are paramount, emphasizing controlled access to avatar animations and ensuring user privacy and identity protection.

Overall, this work represents a substantial step forward in enabling expressive and natural avatar interactions on accessible AR/VR hardware, highlighting the importance of multimodal inputs for realistic animations. The strategies employed for modeling latent spaces and input reconstruction techniques hold potential application across other multimodal AI domains.

PDF Markdown

Related Papers

YouTube

Show All Videos