Audio-Driven 3D Facial Animation from In-the-Wild Videos (2306.11541v1)
Abstract: Given an arbitrary audio clip, audio-driven 3D facial animation aims to generate lifelike lip motions and facial expressions for a 3D head. Existing methods typically rely on training their models using limited public 3D datasets that contain a restricted number of audio-3D scan pairs. Consequently, their generalization capability remains limited. In this paper, we propose a novel method that leverages in-the-wild 2D talking-head videos to train our 3D facial animation model. The abundance of easily accessible 2D talking-head videos equips our model with a robust generalization capability. By combining these videos with existing 3D face reconstruction methods, our model excels in generating consistent and high-fidelity lip synchronization. Additionally, our model proficiently captures the speaking styles of different individuals, allowing it to generate 3D talking-heads with distinct personal styles. Extensive qualitative and quantitative experimental results demonstrate the superiority of our method.
- Lrs3-ted: a large-scale dataset for visual speech recognition. arXiv preprint arXiv:1809.00496, 2018.
- wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems, 33:12449–12460, 2020.
- Hierarchical cross-modal talking face generation with dynamic pixel-wise loss. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7832–7841, 2019.
- 4dfab: A large scale 4d database for facial expression analysis and biometric applications. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5117–5126, 2018.
- You said that? arXiv preprint arXiv:1705.02966, 2017.
- Voxceleb2: Deep speaker recognition. arXiv preprint arXiv:1806.05622, 2018.
- Capture, learning, and synthesis of 3d speaking styles. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10101–10111, 2019.
- Emoca: Emotion driven monocular face capture and animation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20311–20322, 2022.
- Jali: an animator-centric viseme model for expressive lip synchronization. ACM Transactions on graphics (TOG), 35(4):1–11, 2016.
- Faceformer: Speech-driven 3d facial animation with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18770–18780, 2022.
- A 3-d audio-visual corpus of affective communication. IEEE Transactions on Multimedia, 12(6):591–598, 2010.
- Ad-nerf: Audio driven neural radiance fields for talking head synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5784–5794, 2021.
- Modality dropout for improved performance-driven talking faces. In Proceedings of the 2020 International Conference on Multimodal Interaction, pages 378–386, 2020.
- Audio-driven emotional video portraits. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14080–14089, 2021.
- Face animation based on observed 3d speech dynamics. In Proceedings Computer Animation 2001. Fourteenth Conference on Computer Animation (Cat. No. 01TH8596), pages 20–251. IEEE, 2001.
- Audio-driven facial animation by joint end-to-end learning of pose and emotion. ACM Transactions on Graphics (TOG), 36(4):1–12, 2017.
- Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Write-a-speaker: Text-based emotional and rhythmic talking-head generation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 1911–1920, 2021.
- Learning a model of facial shape and expression from 4d scans. ACM Trans. Graph., 36(6):194–1, 2017.
- Expressive talking head generation with granular audio-visual control. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3387–3396, 2022.
- Semantic-aware implicit neural audio-driven video portrait generation. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXVII, pages 106–125. Springer, 2022.
- Video-audio driven real-time facial animation. ACM Transactions on Graphics (TOG), 34(6):1–10, 2015.
- Live speech portraits: real-time photorealistic talking-head animation. ACM Transactions on Graphics (TOG), 40(6):1–17, 2021.
- Mediapipe: A framework for building perception pipelines. arXiv preprint arXiv:1906.08172, 2019.
- 12 animated speech: research progress and applications. 2012.
- Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019.
- End-to-end learning for 3d facial animation from speech. In Proceedings of the 20th ACM International Conference on Multimodal Interaction, pages 361–365, 2018.
- A lip sync expert is all you need for speech to lip generation in the wild. In Proceedings of the 28th ACM International Conference on Multimedia, pages 484–492, 2020.
- An efficient representation for irradiance environment maps. In Proceedings of the 28th annual conference on Computer graphics and interactive techniques, pages 497–500, 2001.
- Accelerating 3d deep learning with pytorch3d. arXiv:2007.08501, 2020.
- Meshtalk: 3d face animation from speech using cross-modality disentanglement. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1173–1182, 2021.
- Lip reading sentences in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6447–6456, 2017.
- Everybody’s talkin’: Let me talk as you want. IEEE Transactions on Information Forensics and Security, 17:585–598, 2022.
- Talking face generation by conditional recurrent adversarial network. arXiv preprint arXiv:1804.04786, 2018.
- Masked lip-sync prediction by audio-visual contextual exploitation in transformers. In SIGGRAPH Asia 2022 Conference Papers, pages 1–9, 2022.
- Synthesizing obama: learning lip sync from audio. ACM Transactions on Graphics (ToG), 36(4):1–13, 2017.
- Dynamic units of visual speech. In Proceedings of the 11th ACM SIGGRAPH/Eurographics conference on Computer Animation, pages 275–284, 2012.
- Neural voice puppetry: Audio-driven facial reenactment. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVI 16, pages 716–731. Springer, 2020.
- Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(11), 2008.
- Mead: A large-scale audio-visual dataset for emotional talking-face generation. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXI, pages 700–717. Springer, 2020.
- One-shot free-view neural talking-head synthesis for video conferencing. In CVPR, 2021.
- Codetalker: Speech-driven 3d facial animation with discrete motion prior. arXiv preprint arXiv:2301.02379, 2023.
- A practical and configurable lip sync method for games. In Proceedings of Motion on Games, pages 131–140. 2013.
- Flow-guided one-shot talking face generation with a high-resolution audio-visual dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3661–3670, 2021.
- Talking face generation by adversarially disentangled audio-visual representation. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pages 9299–9306, 2019.
- Pose-controllable talking face generation by implicitly modularized audio-visual representation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4176–4186, 2021.
- Makelttalk: speaker-aware talking-head animation. ACM Transactions On Graphics (TOG), 39(6):1–15, 2020.
- Instant volumetric head avatars. arXiv preprint arXiv:2211.12499, 2022.
- Liying Lu (7 papers)
- Tianke Zhang (13 papers)
- Yunfei Liu (40 papers)
- Xuangeng Chu (7 papers)
- Yu Li (378 papers)