Exploring Phonetic Context-Aware Lip-Sync For Talking Face Generation (2305.19556v3)
Abstract: Talking face generation is the challenging task of synthesizing a natural and realistic face that requires accurate synchronization with a given audio. Due to co-articulation, where an isolated phone is influenced by the preceding or following phones, the articulation of a phone varies upon the phonetic context. Therefore, modeling lip motion with the phonetic context can generate more spatio-temporally aligned lip movement. In this respect, we investigate the phonetic context in generating lip motion for talking face generation. We propose Context-Aware Lip-Sync framework (CALS), which explicitly leverages phonetic context to generate lip movement of the target face. CALS is comprised of an Audio-to-Lip module and a Lip-to-Face module. The former is pretrained based on masked learning to map each phone to a contextualized lip motion unit. The contextualized lip motion unit then guides the latter in synthesizing a target identity with context-aware lip motion. From extensive experiments, we verify that simply exploiting the phonetic context in the proposed CALS framework effectively enhances spatio-temporal alignment. We also demonstrate the extent to which the phonetic context assists in lip synchronization and find the effective window size for lip generation to be approximately 1.2 seconds.
- ``Generalized statistical modeling of pronunciation variations using variable-length phone context'' In Proceedings.(ICASSP'05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005. 1, 2005, pp. I–689 IEEE
- ``Context-dependent modeling for acoustic-phonetic recognition of continuous speech'' In ICASSP'85. IEEE International Conference on Acoustics, Speech, and Signal Processing 10, 1985, pp. 1205–1208 IEEE
- ``Phoneme recognizer based on long temporal context'' In Speech Processing Group, Faculty of Information Technology, Brno University of Technology.[Online]. Available: http://speech. fit. vutbr. cz/en/software, 2006
- Elana M Zion Golumbic, David Poeppel and Charles E Schroeder ``Temporal context in speech processing and attentional stream selection: a behavioral and neural perspective'' In Brain and language 122.3 Elsevier, 2012, pp. 151–161
- Soo-Whan Chung, Joon Son Chung and Hong-Goo Kang ``Perfect match: Self-supervised embeddings for cross-modal retrieval'' In IEEE Journal of Selected Topics in Signal Processing 14.3 IEEE, 2020, pp. 568–576
- ``Audio-visual synchronisation in the wild'' In arXiv preprint arXiv:2112.04432, 2021
- Venkatesh S Kadandale, Juan F Montesinos and Gloria Haro ``Vocalist: An audio-visual synchronisation model for lips and voices'' In arXiv preprint arXiv:2204.02090, 2022
- Joon Son Chung and Andrew Zisserman ``Out of time: automated lip sync in the wild'' In Computer Vision–ACCV 2016 Workshops: ACCV 2016 International Workshops, Taipei, Taiwan, November 20-24, 2016, Revised Selected Papers, Part II 13, 2017, pp. 251–263 Springer
- ``Lip movements generation at a glance'' In Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 520–535
- ``Talking face generation by conditional recurrent adversarial network'' In arXiv preprint arXiv:1804.04786, 2018
- Amir Jamaludin, Joon Son Chung and Andrew Zisserman ``You said that?: Synthesising talking faces from audio'' In International Journal of Computer Vision 127.11 Springer, 2019, pp. 1767–1779
- ``Towards automatic face-to-face translation'' In Proceedings of the 27th ACM International Conference on Multimedia, 2019, pp. 1428–1436
- Konstantinos Vougioukas, Stavros Petridis and Maja Pantic ``Realistic speech-driven facial animation with gans'' In International Journal of Computer Vision 128.5 Springer, 2020, pp. 1398–1413
- ``Talking face generation by adversarially disentangled audio-visual representation'' In Proceedings of the AAAI Conference on Artificial Intelligence 33.01, 2019, pp. 9299–9306
- ``A lip sync expert is all you need for speech to lip generation in the wild'' In Proceedings of the 28th ACM International Conference on Multimedia, 2020, pp. 484–492
- ``Synctalkface: Talking face generation with precise lip-syncing via audio-lip memory'' In Proceedings of the AAAI Conference on Artificial Intelligence 36.2, 2022, pp. 2062–2070
- ``Flow-Guided One-Shot Talking Face Generation With a High-Resolution Audio-Visual Dataset'' In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 3661–3670
- ``Facial: Synthesizing dynamic talking face with implicit attribute learning'' In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 3867–3876
- ``DFA-NeRF: Personalized Talking Head Generation via Disentangled Face Attributes Neural Rendering'' In arXiv preprint arXiv:2201.00791, 2022
- ``Attention is all you need'' In Advances in neural information processing systems 30, 2017
- ``Bert: Pre-training of deep bidirectional transformers for language understanding'' In arXiv preprint arXiv:1810.04805, 2018
- ``Masked language model scoring'' In arXiv preprint arXiv:1910.14659, 2019
- ``Mask-predict: Parallel decoding of conditional masked language models'' In arXiv preprint arXiv:1904.09324, 2019
- ``Unilmv2: Pseudo-masked language models for unified language model pre-training'' In International conference on machine learning, 2020, pp. 642–652 PMLR
- ``Conformer: Convolution-augmented transformer for speech recognition'' In arXiv preprint arXiv:2005.08100, 2020
- Minsu Kim, Joanna Hong and Yong Man Ro ``Lip to speech synthesis with visual context attentional GAN'' In Advances in Neural Information Processing Systems 34, 2021, pp. 2758–2770
- ``Visual Context-driven Audio Feature Enhancement for Robust End-to-End Audio-Visual Speech Recognition'' In arXiv preprint arXiv:2207.06020, 2022
- ``Disentangled speech embeddings using cross-modal self-supervision'' In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 6829–6833 IEEE
- Joon Son Chung and Andrew Zisserman ``Out of time: automated lip sync in the wild'' In Asian conference on computer vision, 2016, pp. 251–263 Springer
- J.S. Chung, A. Nagrani and A. Zisserman ``VoxCeleb2: Deep Speaker Recognition'' In INTERSPEECH, 2018
- Soo-Whan Chung, Hong Goo Kang and Joon Son Chung ``Seeing voices and hearing voices: learning discriminative embeddings using cross-modal self-supervision'' In arXiv preprint arXiv:2004.14326, 2020
- Joon Son Chung and Andrew Zisserman ``Lip reading in the wild'' In Asian conference on computer vision, 2016, pp. 87–103 Springer
- ``Deep audio-visual speech recognition'' In IEEE transactions on pattern analysis and machine intelligence IEEE, 2018
- ``DSFD: Dual Shot Face Detector'' In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019
- Diederik P Kingma and Jimmy Ba ``Adam: A method for stochastic optimization'' In arXiv preprint arXiv:1412.6980, 2014
- ``Audio2Head: Audio-driven One-shot Talking-head Generation with Natural Head Motion'' In arXiv preprint arXiv:2107.09293, 2021
- ``Pose-controllable talking face generation by implicitly modularized audio-visual representation'' In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 4176–4186