Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
121 tokens/sec
GPT-4o
9 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Exploring Phonetic Context-Aware Lip-Sync For Talking Face Generation (2305.19556v3)

Published 31 May 2023 in cs.CV, cs.AI, cs.SD, eess.AS, and eess.IV

Abstract: Talking face generation is the challenging task of synthesizing a natural and realistic face that requires accurate synchronization with a given audio. Due to co-articulation, where an isolated phone is influenced by the preceding or following phones, the articulation of a phone varies upon the phonetic context. Therefore, modeling lip motion with the phonetic context can generate more spatio-temporally aligned lip movement. In this respect, we investigate the phonetic context in generating lip motion for talking face generation. We propose Context-Aware Lip-Sync framework (CALS), which explicitly leverages phonetic context to generate lip movement of the target face. CALS is comprised of an Audio-to-Lip module and a Lip-to-Face module. The former is pretrained based on masked learning to map each phone to a contextualized lip motion unit. The contextualized lip motion unit then guides the latter in synthesizing a target identity with context-aware lip motion. From extensive experiments, we verify that simply exploiting the phonetic context in the proposed CALS framework effectively enhances spatio-temporal alignment. We also demonstrate the extent to which the phonetic context assists in lip synchronization and find the effective window size for lip generation to be approximately 1.2 seconds.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (37)
  1. ``Generalized statistical modeling of pronunciation variations using variable-length phone context'' In Proceedings.(ICASSP'05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005. 1, 2005, pp. I–689 IEEE
  2. ``Context-dependent modeling for acoustic-phonetic recognition of continuous speech'' In ICASSP'85. IEEE International Conference on Acoustics, Speech, and Signal Processing 10, 1985, pp. 1205–1208 IEEE
  3. ``Phoneme recognizer based on long temporal context'' In Speech Processing Group, Faculty of Information Technology, Brno University of Technology.[Online]. Available: http://speech. fit. vutbr. cz/en/software, 2006
  4. Elana M Zion Golumbic, David Poeppel and Charles E Schroeder ``Temporal context in speech processing and attentional stream selection: a behavioral and neural perspective'' In Brain and language 122.3 Elsevier, 2012, pp. 151–161
  5. Soo-Whan Chung, Joon Son Chung and Hong-Goo Kang ``Perfect match: Self-supervised embeddings for cross-modal retrieval'' In IEEE Journal of Selected Topics in Signal Processing 14.3 IEEE, 2020, pp. 568–576
  6. ``Audio-visual synchronisation in the wild'' In arXiv preprint arXiv:2112.04432, 2021
  7. Venkatesh S Kadandale, Juan F Montesinos and Gloria Haro ``Vocalist: An audio-visual synchronisation model for lips and voices'' In arXiv preprint arXiv:2204.02090, 2022
  8. Joon Son Chung and Andrew Zisserman ``Out of time: automated lip sync in the wild'' In Computer Vision–ACCV 2016 Workshops: ACCV 2016 International Workshops, Taipei, Taiwan, November 20-24, 2016, Revised Selected Papers, Part II 13, 2017, pp. 251–263 Springer
  9. ``Lip movements generation at a glance'' In Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 520–535
  10. ``Talking face generation by conditional recurrent adversarial network'' In arXiv preprint arXiv:1804.04786, 2018
  11. Amir Jamaludin, Joon Son Chung and Andrew Zisserman ``You said that?: Synthesising talking faces from audio'' In International Journal of Computer Vision 127.11 Springer, 2019, pp. 1767–1779
  12. ``Towards automatic face-to-face translation'' In Proceedings of the 27th ACM International Conference on Multimedia, 2019, pp. 1428–1436
  13. Konstantinos Vougioukas, Stavros Petridis and Maja Pantic ``Realistic speech-driven facial animation with gans'' In International Journal of Computer Vision 128.5 Springer, 2020, pp. 1398–1413
  14. ``Talking face generation by adversarially disentangled audio-visual representation'' In Proceedings of the AAAI Conference on Artificial Intelligence 33.01, 2019, pp. 9299–9306
  15. ``A lip sync expert is all you need for speech to lip generation in the wild'' In Proceedings of the 28th ACM International Conference on Multimedia, 2020, pp. 484–492
  16. ``Synctalkface: Talking face generation with precise lip-syncing via audio-lip memory'' In Proceedings of the AAAI Conference on Artificial Intelligence 36.2, 2022, pp. 2062–2070
  17. ``Flow-Guided One-Shot Talking Face Generation With a High-Resolution Audio-Visual Dataset'' In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 3661–3670
  18. ``Facial: Synthesizing dynamic talking face with implicit attribute learning'' In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 3867–3876
  19. ``DFA-NeRF: Personalized Talking Head Generation via Disentangled Face Attributes Neural Rendering'' In arXiv preprint arXiv:2201.00791, 2022
  20. ``Attention is all you need'' In Advances in neural information processing systems 30, 2017
  21. ``Bert: Pre-training of deep bidirectional transformers for language understanding'' In arXiv preprint arXiv:1810.04805, 2018
  22. ``Masked language model scoring'' In arXiv preprint arXiv:1910.14659, 2019
  23. ``Mask-predict: Parallel decoding of conditional masked language models'' In arXiv preprint arXiv:1904.09324, 2019
  24. ``Unilmv2: Pseudo-masked language models for unified language model pre-training'' In International conference on machine learning, 2020, pp. 642–652 PMLR
  25. ``Conformer: Convolution-augmented transformer for speech recognition'' In arXiv preprint arXiv:2005.08100, 2020
  26. Minsu Kim, Joanna Hong and Yong Man Ro ``Lip to speech synthesis with visual context attentional GAN'' In Advances in Neural Information Processing Systems 34, 2021, pp. 2758–2770
  27. ``Visual Context-driven Audio Feature Enhancement for Robust End-to-End Audio-Visual Speech Recognition'' In arXiv preprint arXiv:2207.06020, 2022
  28. ``Disentangled speech embeddings using cross-modal self-supervision'' In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 6829–6833 IEEE
  29. Joon Son Chung and Andrew Zisserman ``Out of time: automated lip sync in the wild'' In Asian conference on computer vision, 2016, pp. 251–263 Springer
  30. J.S. Chung, A. Nagrani and A. Zisserman ``VoxCeleb2: Deep Speaker Recognition'' In INTERSPEECH, 2018
  31. Soo-Whan Chung, Hong Goo Kang and Joon Son Chung ``Seeing voices and hearing voices: learning discriminative embeddings using cross-modal self-supervision'' In arXiv preprint arXiv:2004.14326, 2020
  32. Joon Son Chung and Andrew Zisserman ``Lip reading in the wild'' In Asian conference on computer vision, 2016, pp. 87–103 Springer
  33. ``Deep audio-visual speech recognition'' In IEEE transactions on pattern analysis and machine intelligence IEEE, 2018
  34. ``DSFD: Dual Shot Face Detector'' In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019
  35. Diederik P Kingma and Jimmy Ba ``Adam: A method for stochastic optimization'' In arXiv preprint arXiv:1412.6980, 2014
  36. ``Audio2Head: Audio-driven One-shot Talking-head Generation with Natural Head Motion'' In arXiv preprint arXiv:2107.09293, 2021
  37. ``Pose-controllable talking face generation by implicitly modularized audio-visual representation'' In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 4176–4186
Citations (2)

Summary

We haven't generated a summary for this paper yet.