G4G:A Generic Framework for High Fidelity Talking Face Generation with Fine-grained Intra-modal Alignment
Abstract: Despite numerous completed studies, achieving high fidelity talking face generation with highly synchronized lip movements corresponding to arbitrary audio remains a significant challenge in the field. The shortcomings of published studies continue to confuse many researchers. This paper introduces G4G, a generic framework for high fidelity talking face generation with fine-grained intra-modal alignment. G4G can reenact the high fidelity of original video while producing highly synchronized lip movements regardless of given audio tones or volumes. The key to G4G's success is the use of a diagonal matrix to enhance the ordinary alignment of audio-image intra-modal features, which significantly increases the comparative learning between positive and negative samples. Additionally, a multi-scaled supervision module is introduced to comprehensively reenact the perceptional fidelity of original video across the facial region while emphasizing the synchronization of lip movements and the input audio. A fusion network is then used to further fuse the facial region and the rest. Our experimental results demonstrate significant achievements in reenactment of original video quality as well as highly synchronized talking lips. G4G is an outperforming generic framework that can produce talking videos competitively closer to ground truth level than current state-of-the-art methods.
- Y. Zhou, X. Han, E. Shechtman, J. Echevarria, E. Kalogerakis, and D. Li, “Makelttalk: speaker-aware talking-head animation,” ACM Transactions On Graphics (TOG), vol. 39, no. 6, pp. 1–15, 2020.
- B. Liang, Y. Pan, Z. Guo, H. Zhou, Z. Hong, X. Han, J. Han, J. Liu, E. Ding, and J. Wang, “Expressive talking head generation with granular audio-visual control,” in CVPR, 2022, pp. 3387–3396.
- Z. Qi, X. Zhang, N. Cheng, J. Xiao, and J. Wang, “Difftalker: Co-driven audio-image diffusion for talking faces via intermediate landmarks,” arXiv preprint arXiv:2309.07509, 2023.
- B. Zhang, C. Qi, P. Zhang, B. Zhang, H. Wu, D. Chen, Q. Chen, Y. Wang, and F. Wen, “Metaportrait: Identity-preserving talking head generation with fast personalized adaptation,” in CVPR, 2023, pp. 22 096–22 105.
- Y. Zhang, W. He, M. Li, K. Tian, Z. Zhang, J. Cheng, Y. Wang, and J. Liao, “Meta talk: Learning to data-efficiently generate audio-driven lip-synchronized talking face with high definition,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 4848–4852.
- Y. Ma, S. Wang, Z. Hu, C. Fan, T. Lv, Y. Ding, Z. Deng, and X. Yu, “Styletalk: One-shot talking head generation with controllable speaking styles,” arXiv preprint arXiv:2301.01081, 2023.
- M. Stypulkowski, K. Vougioukas, S. He, M. Zikeba, S. Petridis, and M. Pantic, “Diffused heads: Diffusion models beat gans on talking-face generation,” pp. 5091–5100, 2024.
- Y. Guo, K. Chen, S. Liang, Y.-J. Liu, H. Bao, and J. Zhang, “Ad-nerf: Audio driven neural radiance fields for talking head synthesis,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 5784–5794.
- T. Afouras, J. S. Chung, A. Senior, O. Vinyals, and A. Zisserman, “Deep audio-visual speech recognition,” IEEE transactions on pattern analysis and machine intelligence, vol. 44, no. 12, pp. 8717–8727, 2018.
- Z. Zhang, L. Li, Y. Ding, and C. Fan, “Flow-guided one-shot talking face generation with a high-resolution audio-visual dataset,” in CVPR, 2021, pp. 3661–3670.
- V. Blanz and T. Vetter, “A morphable model for the synthesis of 3d faces,” in Seminal Graphics Papers: Pushing the Boundaries, Volume 2, 2023, pp. 157–164.
- B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng, “Nerf: Representing scenes as neural radiance fields for view synthesis,” Communications of the ACM, vol. 65, no. 1, pp. 99–106, 2021.
- S. Biswas, S. Sinha, D. Das, and B. Bhowmick, “Realistic talking face animation with speech-induced head motion,” in Proceedings of the Twelfth Indian Conference on Computer Vision, Graphics and Image Processing, 2021, pp. 1–9.
- L. Chen, G. Cui, C. Liu, Z. Li, Z. Kou, Y. Xu, and C. Xu, “Talking-head generation with rhythmic head motion,” in European Conference on Computer Vision. Springer, 2020, pp. 35–51.
- S. Shen, W. Zhao, Z. Meng, W. Li, Z. Zhu, J. Zhou, and J. Lu, “Difftalk: Crafting diffusion models for generalized audio-driven portraits animation,” in CVPR, 2023, pp. 1982–1991.
- X. Wu, P. Hu, Y. Wu, X. Lyu, Y.-P. Cao, Y. Shan, W. Yang, Z. Sun, and X. Qi, “Speech2lip: High-fidelity speech to lip generation by learning from a short video,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 22 168–22 177.
- F.-T. Hong and D. Xu, “Implicit identity representation conditioned memory compensation network for talking head video generation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 23 062–23 072.
- S. Goyal, S. Uppal, S. Bhagat, Y. Yu, Y. Yin, and R. R. Shah, “Emotionally enhanced talking face generation,” arXiv preprint arXiv:2303.11548, 2023.
- S. Wang, L. Li, Y. Ding, and X. Yu, “One-shot talking face generation from single-speaker audio-visual correlation learning,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 3, 2022, pp. 2531–2539.
- Z. Zhang and Y. Ding, “Adaptive affine transformation: A simple and effective operation for spatial misaligned image generation,” in Proceedings of the 30th ACM International Conference on Multimedia, 2022, pp. 1167–1176.
- J. Wang, K. Zhao, S. Zhang, Y. Zhang, Y. Shen, D. Zhao, and J. Zhou, “Lipformer: High-fidelity and generalizable talking face generation with a pre-learned facial codebook,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 13 844–13 853.
- J. S. Chung and A. Zisserman, “Out of time: automated lip sync in the wild,” in Computer Vision–ACCV 2016 Workshops: ACCV 2016 International Workshops, Taipei, Taiwan, November 20-24, 2016, Revised Selected Papers, Part II 13. Springer, 2017, pp. 251–263.
- H. Zhu, H. Huang, Y. Li, A. Zheng, and R. He, “Arbitrary talking face generation via attentional audio-visual coherence learning,” arXiv preprint arXiv:1812.06589, 2018.
- Y. Jang, K. Rho, J. Woo, H. Lee, J. Park, Y. Lim, B.-Y. Kim, and J. S. Chung, “That’s what i said: Fully-controllable talking face generation,” in Proceedings of the 31st ACM International Conference on Multimedia, 2023, pp. 3827–3836.
- R. Gao and K. Grauman, “Visualvoice: Audio-visual speech separation with cross-modal consistency,” in 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2021, pp. 15 490–15 500.
- A. Nagrani, J. S. Chung, S. Albanie, and A. Zisserman, “Disentangled speech embeddings using cross-modal self-supervision,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 6829–6833.
- Z. Ma, X. Zhu, G.-J. Qi, Z. Lei, and L. Zhang, “Otavatar: One-shot talking face avatar with controllable tri-plane rendering,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 16 901–16 910.
- J. Wang, X. Qian, M. Zhang, R. T. Tan, and H. Li, “Seeing what you said: Talking face generation guided by a lip reading expert,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 14 653–14 662.
- C. Du, Q. Chen, T. He, X. Tan, X. Chen, K. Yu, S. Zhao, and J. Bian, “Dae-talker: High fidelity speech-driven talking face generation with diffusion autoencoder,” arXiv preprint arXiv:2303.17550, 2023.
- J. Liu, X. Wang, X. Fu, Y. Chai, C. Yu, J. Dai, and J. Han, “Font: Flow-guided one-shot talking head generation with natural head motions,” arXiv preprint arXiv:2303.17789, 2023.
- Z. Zhang, Z. Hu, W. Deng, C. Fan, T. Lv, and Y. Ding, “Dinet: Deformation inpainting network for realistic face visually dubbing on high resolution video,” AAAI, 2023.
- W. Li, L. Zhang, D. Wang, B. Zhao, Z. Wang, M. Chen, B. Zhang, Z. Wang, L. Bo, and X. Li, “One-shot high-fidelity talking-head synthesis with deformable neural radiance field,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 17 969–17 978.
- Y. Gao, Y. Zhou, J. Wang, X. Li, X. Ming, and Y. Lu, “High-fidelity and freely controllable talking head video generation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 5609–5619.
- K. Liu, Y.-C. Su, R. Cang, X. Jia et al., “Controllable one-shot face video synthesis with semantic aware prior,” arXiv preprint arXiv:2304.14471, 2023.
- C. Xu, J. Zhu, J. Zhang, Y. Han, W. Chu, Y. Tai, C. Wang, Z. Xie, and Y. Liu, “High-fidelity generalized emotional talking face generation with multi-modal emotion space learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 6609–6619.
- K. Prajwal, R. Mukhopadhyay, V. P. Namboodiri, and C. Jawahar, “A lip sync expert is all you need for speech to lip generation in the wild,” in MM, 2020, pp. 484–492.
- R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, “The unreasonable effectiveness of deep features as a perceptual metric,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 586–595.
- H. Zhou, Y. Sun, W. Wu, C. C. Loy, X. Wang, and Z. Liu, “Pose-controllable talking face generation by implicitly modularized audio-visual representation,” in CVPR, 2021, pp. 4176–4186.
- S. J. Park, M. Kim, J. Hong, J. Choi, and Y. M. Ro, “Synctalkface: Talking face generation with precise lip-syncing via audio-lip memory,” in AAAI, vol. 36, no. 2, 2022, pp. 2062–2070.
- N. Chen, Y. Zhang, H. Zen, R. J. Weiss, M. Norouzi, N. Dehak, and W. Chan, “Wavegrad 2: Iterative refinement for text-to-speech synthesis,” arXiv preprint arXiv:2106.09660, 2021.
- J. Kim, J. Kong, and J. Son, “Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech,” in International Conference on Machine Learning. PMLR, 2021, pp. 5530–5540.
- S. Maiti, E. Marchi, and A. Conkie, “Generating multilingual voices using speaker space translation based on bilingual speaker data,” in ICASSP. IEEE, 2020, pp. 7624–7628.
- Y. Zhang, R. J. Weiss, H. Zen, Y. Wu, Z. Chen, R. Skerry-Ryan, Y. Jia, A. Rosenberg, and B. Ramabhadran, “Learning to speak fluently in a foreign language: Multilingual speech synthesis and cross-language voice cloning,” arXiv preprint arXiv:1907.04448, 2019.
- T. Xie, L. Liao, C. Bi, B. Tang, X. Yin, J. Yang, M. Wang, J. Yao, Y. Zhang, and Z. Ma, “Towards realistic visual dubbing with heterogeneous sources,” in Proceedings of the 29th ACM International Conference on Multimedia, 2021, pp. 1739–1747.
- X. Ji, H. Zhou, K. Wang, W. Wu, C. C. Loy, X. Cao, and F. Xu, “Audio-driven emotional video portraits,” in CVPR, 2021, pp. 14 080–14 089.
- Y. Lu, J. Chai, and X. Cao, “Live speech portraits: real-time photorealistic talking-head animation,” TOG, vol. 40, no. 6, pp. 1–17, 2021.
- J. Thies, M. Elgharib, A. Tewari, C. Theobalt, and M. Nießner, “Neural voice puppetry: Audio-driven facial reenactment,” in ECCV. Springer, 2020, pp. 716–731.
- H. Zhou, Y. Liu, Z. Liu, P. Luo, and X. Wang, “Talking face generation by adversarially disentangled audio-visual representation,” in Proceedings of the AAAI conference on artificial intelligence, vol. 33, no. 01, 2019, pp. 9299–9306.
- Y. Ren, G. Li, Y. Chen, T. H. Li, and S. Liu, “Pirenderer: Controllable portrait image generation via semantic neural rendering,” in ICCV, 2021, pp. 13 759–13 768.
- Y. Liu, L. Lin, F. Yu, C. Zhou, and Y. Li, “Moda: Mapping-once audio-driven portrait animation with dual attentions,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 23 020–23 029.
- W. Zhong, C. Fang, Y. Cai, P. Wei, G. Zhao, L. Lin, and G. Li, “Identity-preserving talking face generation with landmark and appearance priors,” in CVPR, 2023, pp. 9729–9738.
- M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, “Gans trained by a two time-scale update rule converge to a local nash equilibrium,” Advances in neural information processing systems, vol. 30, 2017.
- A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in International conference on machine learning. PMLR, 2021, pp. 8748–8763.
- J. Li, R. Selvaraju, A. Gotmare, S. Joty, C. Xiong, and S. C. H. Hoi, “Align before fuse: Vision and language representation learning with momentum distillation,” Advances in neural information processing systems, vol. 34, pp. 9694–9705, 2021.
- X. Ji, H. Zhou, K. Wang, Q. Wu, W. Wu, F. Xu, and X. Cao, “Eamm: One-shot emotional talking face via audio-based emotion-aware motion model,” in ACM SIGGRAPH 2022 Conference Proceedings, 2022, pp. 1–10.
- X. Liu, Y. Xu, Q. Wu, H. Zhou, W. Wu, and B. Zhou, “Semantic-aware implicit neural audio-driven video portrait generation,” in European Conference on Computer Vision. Springer, 2022, pp. 106–125.
- X. Huang and S. Belongie, “Arbitrary style transfer in real-time with adaptive instance normalization,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 1501–1510.
- Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: from error visibility to structural similarity,” IEEE transactions on image processing, vol. 13, no. 4, pp. 600–612, 2004.
- Z.-H. Ling, K. Richmond, and J. Yamagishi, “An analysis of hmm-based prediction of articulatory movements,” Speech Communication, vol. 52, no. 10, pp. 834–846, 2010.
- L. Song, W. Wu, C. Qian, R. He, and C. C. Loy, “Everybody’s talkin’: Let me talk as you want,” IEEE Transactions on Information Forensics and Security, vol. 17, pp. 585–598, 2022.
- K. Cheng, X. Cun, Y. Zhang, M. Xia, F. Yin, M. Zhu, X. Wang, J. Wang, and N. Wang, “Videoretalking: Audio-based lip synchronization for talking head video editing in the wild,” in SIGGRAPH Asia 2022 Conference Papers, 2022, pp. 1–9.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.