Capture, Learning, and Synthesis of 3D Speaking Styles (1905.03079v1)

Published 8 May 2019 in cs.CV

Abstract: Audio-driven 3D facial animation has been widely explored, but achieving realistic, human-like performance is still unsolved. This is due to the lack of available 3D datasets, models, and standard evaluation metrics. To address this, we introduce a unique 4D face dataset with about 29 minutes of 4D scans captured at 60 fps and synchronized audio from 12 speakers. We then train a neural network on our dataset that factors identity from facial motion. The learned model, VOCA (Voice Operated Character Animation) takes any speech signal as input - even speech in languages other than English - and realistically animates a wide range of adult faces. Conditioning on subject labels during training allows the model to learn a variety of realistic speaking styles. VOCA also provides animator controls to alter speaking style, identity-dependent facial shape, and pose (i.e. head, jaw, and eyeball rotations) during animation. To our knowledge, VOCA is the only realistic 3D facial animation model that is readily applicable to unseen subjects without retargeting. This makes VOCA suitable for tasks like in-game video, virtual reality avatars, or any scenario in which the speaker, speech, or language is not known in advance. We make the dataset and model available for research purposes at http://voca.is.tue.mpg.de.

Citations (306)

View on Semantic Scholar

Summary

The paper introduces VOCA, a novel deep neural framework that separates facial identity from motion to synthesize 3D speaking styles.
It leverages the VOCASET dataset of high-resolution 4D facial scans to ensure robust generalization across diverse languages and speakers.
The model integrates FLAME and DeepSpeech to offer animator controls for adjusting facial expressions, head, and eyeball rotations in real time.

Capture, Learning, and Synthesis of 3D Speaking Styles

The paper "Capture, Learning, and Synthesis of 3D Speaking Styles" presents a comprehensive approach to synthesizing realistic 3D facial animations driven by an audio input. The authors address key challenges in the field of audio-driven facial animation, which include the creation of a robust method that generalizes across various languages, speakers, and individual facial characteristics. To achieve this, the authors propose VOCA, a model that capitalizes on a newly compiled dataset, VOCASET, consisting of high-resolution 4D facial scans synchronized with audio.

The VOCA model distinguishes itself by factoring identity from facial motions, thus enabling the animation of a wide range of adult faces without retargeting. The system utilizes DeepSpeech for audio feature extraction, enhancing robustness to different audio sources and noise. VOCA leverages the FLAME head model to handle variations in facial shape and expression, offering significant flexibility in creating animations that are speaker-independent and capable of simulating distinct speaking styles.

Key Contributions

VOCASET Dataset: The authors introduce VOCASET, comprising 4D scans and audio from 12 speakers, intended to remedy the lack of comprehensive 3D facial data for speech-to-animation models. This dataset allows for training models that generalize well to unseen subjects and sentences.
Model Architecture: VOCA incorporates a deep neural network architecture where the encoder learns to map audio features to a low-dimensional space, and a decoder transforms these into 3D vertex displacements. This design facilitates the separation of speaking style from facial identity.
Animation Controls: The model provides animator controls, allowing for adjustments in speaking style, facial shape, and motion parameters, such as head and eyeball rotations. This adaptability makes it suitable for applications such as virtual reality or in-game videos where pre-recording specific facial animations is impractical.
Robustness and Generalization: By conditioning on subject labels during training and utilizing advanced speech feature extraction, VOCA exhibits strong generalization across different spoken languages and unseen subjects while maintaining realistic facial motion.

Implications and Future Directions

The methodologies presented in this paper implicate significant advancements in automating realistic facial animation from audio, particularly for applications in interactive media and virtual environments. The provision of VOCA as an open-source tool encourages broader research and development in the area.

However, several aspects warrant further exploration. The model could benefit from incorporating upper face movements and non-verbal communication cues, which are relatively less correlated with speech audio. Moreover, exploring models that can learn more nuanced emotional expressions without dedicated emotional training data could further enhance conversational realism.

Future work may also focus on expanding the dataset to encompass a wider diversity of languages and facial characteristics, potentially improving the universality and accuracy of the generated animations. Enhancing the model's capability to handle more nuanced speech-related facial dynamics, such as subtle eye and brow movements involved in emotional expressions, is another promising avenue for future research.

In conclusion, the VOCA framework represents a significant milestone in the field of automatic 3D facial animation, promising wide-ranging applications across various domains driven by speech input. By providing both the dataset and model for further research, the authors pave the way for ongoing improvements and innovations within this domain.

PDF Markdown

Related Papers

YouTube

Show All Videos