A Comprehensive Multi-scale Approach for Speech and Dynamics Synchrony in Talking Head Generation (2307.03270v2)
Abstract: Animating still face images with deep generative models using a speech input signal is an active research topic and has seen important recent progress.However, much of the effort has been put into lip syncing and rendering quality while the generation of natural head motion, let alone the audio-visual correlation between head motion and speech, has often been neglected.In this work, we propose a multi-scale audio-visual synchrony loss and a multi-scale autoregressive GAN to better handle short and long-term correlation between speech and the dynamics of the head and lips.In particular, we train a stack of syncer models on multimodal input pyramids and use these models as guidance in a multi-scale generator network to produce audio-aligned motion unfolding over diverse time scales.Both the pyramid of audio-visual syncers and the generative models are trained in a low-dimensional space that fully preserves dynamics cues.The experiments show significant improvements over the state-of-the-art in head motion dynamics quality and especially in multi-scale audio-visual synchrony on a collection of benchmark datasets.
- Deep audio-visual speech recognition. In arXiv:1809.02108, 2018.
- Autoregressive gan for semantic unconditional head motion generation. arXiv preprint arXiv:2211.00987, 2022.
- How far are we from solving the 2d & 3d face alignment problem? (and a dataset of 230,000 3d facial landmarks). In International Conference on Computer Vision, 2017.
- Talking-head generation with rhythmic head motion. In European Conference on Computer Vision, pages 35–51. Springer, 2020.
- Lip movements generation at a glance. In Proceedings of the European conference on computer vision (ECCV), pages 520–535, 2018.
- Hierarchical cross-modal talking face generation with dynamic pixel-wise loss. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7832–7841, 2019.
- Voxceleb2: Deep speaker recognition. In INTERSPEECH, 2018.
- Out of time: automated lip sync in the wild. In Computer Vision–ACCV 2016 Workshops: ACCV 2016 International Workshops, Taipei, Taiwan, November 20-24, 2016, Revised Selected Papers, Part II 13, pages 251–263. Springer, 2017.
- Speech-driven facial animation using cascaded gans for learning of motion and texture. In European conference on computer vision, pages 408–424. Springer, 2020.
- Deep generative image models using a laplacian pyramid of adversarial networks. Advances in neural information processing systems, 28, 2015.
- Faceformer: Speech-driven 3d facial animation with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18770–18780, 2022.
- Joint learning of facial expression and head pose from speech. In Interspeech, 2018.
- Ad-nerf: Audio driven neural radiance fields for talking head synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5784–5794, 2021.
- Marionette: Few-shot face reenactment preserving identity of unseen targets. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 10893–10900, 2020.
- Eamm: One-shot emotional talking face via audio-based emotion-aware motion model. arXiv preprint arXiv:2205.15278, 2022.
- Audio-driven facial animation by joint end-to-end learning of pose and emotion. ACM Transactions on Graphics (TOG), 36(4):1–12, 2017.
- A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4401–4410, 2019.
- Melgan: Generative adversarial networks for conditional waveform synthesis. Advances in neural information processing systems, 32, 2019.
- Geometric gan. arXiv preprint arXiv:1705.02894, 2017.
- Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2117–2125, 2017.
- Human motion modeling using dvgans. arXiv preprint arXiv:1804.10652, 2018.
- Pacgan: The power of two samples in generative adversarial networks. Advances in neural information processing systems, 31, 2018.
- A convnet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11976–11986, 2022.
- Learned spatial representations for few-shot talking-head synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 13829–13838, 2021.
- Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
- Synctalkface: Talking face generation with precise lip-syncing via audio-lip memory. In 36th AAAI Conference on Artificial Intelligence (AAAI 22). Association for the Advancement of Artificial Intelligence, 2022.
- A lip sync expert is all you need for speech to lip generation in the wild. In Proceedings of the 28th ACM International Conference on Multimedia, pages 484–492, 2020.
- Pirenderer: Controllable portrait image generation via semantic neural rendering. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 13759–13768, 2021.
- First order motion model for image animation. Advances in Neural Information Processing Systems, 32, 2019.
- Emotion-controllable generalized talking face generation. arXiv preprint arXiv:2205.01155, 2022.
- Talking face generation by conditional recurrent adversarial network. arXiv preprint arXiv:1804.04786, 2018.
- Synthesizing obama: learning lip sync from audio. ACM Transactions on Graphics (ToG), 36(4):1–13, 2017.
- Efficientdet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10781–10790, 2020.
- A deep learning approach for generalized speech animation. ACM Transactions on Graphics (TOG), 36(4):1–11, 2017.
- Realistic speech-driven facial animation with gans. International Journal of Computer Vision, 128(5):1398–1413, 2020.
- Progressive disentangled representation learning for fine-grained controllable talking head synthesis. arXiv preprint arXiv:2211.14506, 2022.
- Audio2head: Audio-driven one-shot talking-head generation with natural head motion. In IJCAI, 2021.
- One-shot talking face generation from single-speaker audio-visual correlation learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 2531–2539, 2022.
- High-resolution image synthesis and semantic manipulation with conditional gans. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8798–8807, 2018.
- One-shot free-view neural talking-head synthesis for video conferencing. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10039–10049, 2021.
- Linking facial animation, head motion and speech acoustics. Journal of phonetics, 30(3):555–568, 2002.
- Audio-driven talking face video generation with learning-based personalized head pose. arXiv preprint arXiv:2002.10137, 2020.
- Styleheat: One-shot high-resolution editable talking face generation via pre-trained stylegan. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XVII, pages 85–101. Springer, 2022.
- Fast bi-layer neural synthesis of one-shot realistic head avatars. In European Conference on Computer Vision, pages 524–540. Springer, 2020.
- Few-shot adversarial learning of realistic neural talking head models. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9459–9468, 2019.
- Flow-guided one-shot talking face generation with a high-resolution audio-visual dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3661–3670, 2021.
- Sparse to dense motion transfer for face image animation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1991–2000, 2021.
- Talking face generation by adversarially disentangled audio-visual representation. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pages 9299–9306, 2019.
- Pose-controllable talking face generation by implicitly modularized audio-visual representation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4176–4186, 2021.
- Makelttalk: speaker-aware talking-head animation. ACM Transactions on Graphics (TOG), 39(6):1–15, 2020.
- Arbitrary talking face generation via attentional audio-visual coherence learning. arXiv preprint arXiv:1812.06589, 2018.