FaceFormer: Speech-Driven 3D Facial Animation with Transformers (2112.05329v4)

Published 10 Dec 2021 in cs.CV and cs.GR

Abstract: Speech-driven 3D facial animation is challenging due to the complex geometry of human faces and the limited availability of 3D audio-visual data. Prior works typically focus on learning phoneme-level features of short audio windows with limited context, occasionally resulting in inaccurate lip movements. To tackle this limitation, we propose a Transformer-based autoregressive model, FaceFormer, which encodes the long-term audio context and autoregressively predicts a sequence of animated 3D face meshes. To cope with the data scarcity issue, we integrate the self-supervised pre-trained speech representations. Also, we devise two biased attention mechanisms well suited to this specific task, including the biased cross-modal multi-head (MH) attention and the biased causal MH self-attention with a periodic positional encoding strategy. The former effectively aligns the audio-motion modalities, whereas the latter offers abilities to generalize to longer audio sequences. Extensive experiments and a perceptual user study show that our approach outperforms the existing state-of-the-arts. The code will be made available.

Citations (164)

View on Semantic Scholar

Summary

FaceFormer: Speech-Driven 3D Facial Animation with Transformers

The paper entitled "FaceFormer: Speech-Driven 3D Facial Animation with Transformers" introduces a novel Transformer-based autoregressive model designed to enhance the synthesis of speech-driven 3D facial animations. The challenge of accurately animating 3D facial geometry in response to audio input is compounded by the intricate geometry of human faces and the scarcity of high-quality 3D audio-visual datasets. This work seeks to confront these challenges via innovative use of transformer architectures, specifically augmenting them with self-supervised learning approaches and specialized attention mechanisms.

Methodology

Model Architecture:
- The authors present FaceFormer, an autoregressive Transformer-based model. The architecture stands out by focusing on long-term audio context through transformers, which inherently handle sequence data well.
- The encoder leverages wav2vec 2.0, a self-supervised speech model, to extract rich audio features. This model is pretrained on extensive audio datasets, offering generalization abilities despite limited 3D training data.
Biasing Mechanisms:
- Two novel biased attention mechanisms are introduced: biased cross-modal multi-head attention and biased causal multi-head self-attention. These mechanisms improve modality alignment between audio signals and motion outputs and ensure temporal stability while animating longer sequences.
- A periodic positional encoding (PPE) strategy enhances sequence length generalization beyond training data. This circumvents the traditional limitations of positional encodings in transformer architectures.
Data Handling:
- To address the common issue of 3D data scarcity, the FaceFormer model integrates self-supervised speech representations. This allows the model to function effectively without extensive 3D datasets by leveraging waves extracted and transformed through wav2vec 2.0 models.

Results

Extensive experiments demonstrate the superiority of FaceFormer in generating realistic facial animations with superior lip-sync performance compared to existing solutions such as VOCA and MeshTalk. The model shows distinguishable improvements in lip synchronization with measured error reductions, as detailed in Table 1, which evaluates lip vertex error.

Additionally, perceptual evaluation through user studies highlights FaceFormer's capability to produce animations with a greater degree of realism and synchronization, as indicated in Tables 2 and 3.

Implications and Future Directions

The implications of this work are significant for applications in virtual reality, film production, gaming, and education where realistic 3D facial animation is crucial. By improving the integration of speech and facial motion modalities, FaceFormer advances the state of technology in speech-driven animation with potential uses ranging from animated avatars to real-time digital assistants.

For future research, given the computational demands of transformer models, exploring efficiency optimization methods to enable real-time applications remains vital. Techniques such as Linformer or Longformer could be evaluated to manage the quadratic complexity often associated with standard transformer models.

Conclusion

FaceFormer is a compelling novel approach to speech-driven 3D facial animation, offering a robust framework for synthesizing realistic animations despite challenging data conditions. By leveraging the strengths of Transformer architectures and pre-trained models, it provides a clear path forward in this domain, promising enhanced capabilities for a variety of industrial applications. Further exploration and refinement of the model's efficiency and scalability will be beneficial for practical deployments and in broadening the scope of its applications.

FaceFormer: Speech-Driven 3D Facial Animation with Transformers (2112.05329v4)

Summary

FaceFormer: Speech-Driven 3D Facial Animation with Transformers

Methodology

Results

Implications and Future Directions

Conclusion

Follow-up Questions

Authors (5)

FaceFormer: Speech-Driven 3D Facial Animation with Transformers (2112.05329v4)

Summary

FaceFormer: Speech-Driven 3D Facial Animation with Transformers

Methodology

Results

Implications and Future Directions

Conclusion

Follow-up Questions

Related Papers

Authors (5)