GSmoothFace: Generalized Smooth Talking Face Generation via Fine Grained 3D Face Guidance (2312.07385v1)
Abstract: Although existing speech-driven talking face generation methods achieve significant progress, they are far from real-world application due to the avatar-specific training demand and unstable lip movements. To address the above issues, we propose the GSmoothFace, a novel two-stage generalized talking face generation model guided by a fine-grained 3d face model, which can synthesize smooth lip dynamics while preserving the speaker's identity. Our proposed GSmoothFace model mainly consists of the Audio to Expression Prediction (A2EP) module and the Target Adaptive Face Translation (TAFT) module. Specifically, we first develop the A2EP module to predict expression parameters synchronized with the driven speech. It uses a transformer to capture the long-term audio context and learns the parameters from the fine-grained 3D facial vertices, resulting in accurate and smooth lip-synchronization performance. Afterward, the well-designed TAFT module, empowered by Morphology Augmented Face Blending (MAFB), takes the predicted expression parameters and target video as inputs to modify the facial region of the target video without distorting the background content. The TAFT effectively exploits the identity appearance and background context in the target video, which makes it possible to generalize to different speakers without retraining. Both quantitative and qualitative experiments confirm the superiority of our method in terms of realism, lip synchronization, and visual quality. See the project page for code, data, and request pre-trained models: https://zhanghm1995.github.io/GSmoothFace.
- D. Cudeiro, T. Bolkart, C. Laidlaw, A. Ranjan, and M. J. Black, “Capture, learning, and synthesis of 3d speaking styles,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 10 101–10 111.
- A. Richard, M. Zollhöfer, Y. Wen, F. de la Torre, and Y. Sheikh, “Meshtalk: 3d face animation from speech using cross-modality disentanglement,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2021, pp. 1173–1182.
- Y. Zhou, Z. Xu, C. Landreth, E. Kalogerakis, S. Maji, and K. Singh, “Visemenet: Audio-driven animator-centric speech animation,” ACM Transactions on Graphics (TOG), vol. 37, no. 4, pp. 1–10, 2018.
- T. Xie, L. Liao, C. Bi, B. Tang, X. Yin, J. Yang, M. Wang, J. Yao, Y. Zhang, and Z. Ma, “Towards realistic visual dubbing with heterogeneous sources,” in Proceedings of the 29th ACM International Conference on Multimedia, 2021, pp. 1739–1747.
- T.-C. Wang, A. Mallya, and M.-Y. Liu, “One-shot free-view neural talking-head synthesis for video conferencing,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2021, pp. 10 039–10 049.
- Y. Zhou, X. Han, E. Shechtman, J. Echevarria, E. Kalogerakis, and D. Li, “Makelttalk: Speaker-aware talking-head animation,” ACM Trans. Graph., vol. 39, no. 6, nov 2020. [Online]. Available: https://doi.org/10.1145/3414685.3417774
- A. Jamaludin, J. S. Chung, and A. Zisserman, “You said that?: Synthesising talking faces from audio,” International Journal of Computer Vision, vol. 127, no. 11, pp. 1767–1779, 2019.
- P. KR, R. Mukhopadhyay, J. Philip, A. Jha, V. Namboodiri, and C. Jawahar, “Towards automatic face-to-face translation,” in Proceedings of the 27th ACM International Conference on Multimedia, 2019, pp. 1428–1436.
- K. Vougioukas, S. Petridis, and M. Pantic, “Realistic speech-driven facial animation with gans,” International Journal of Computer Vision, vol. 128, no. 5, pp. 1398–1413, 2020.
- K. Prajwal, R. Mukhopadhyay, V. P. Namboodiri, and C. Jawahar, “A lip sync expert is all you need for speech to lip generation in the wild,” in Proceedings of the 28th ACM International Conference on Multimedia, 2020, pp. 484–492.
- H. Zhou, Y. Liu, Z. Liu, P. Luo, and X. Wang, “Talking face generation by adversarially disentangled audio-visual representation,” in Proceedings of the AAAI conference on artificial intelligence, vol. 33, no. 01, 2019, pp. 9299–9306.
- S. E. Eskimez, Y. Zhang, and Z. Duan, “Speech driven talking face generation from a single image and an emotion condition,” IEEE Transactions on Multimedia, vol. 24, pp. 3480–3490, 2021.
- L. Chen, R. K. Maddox, Z. Duan, and C. Xu, “Hierarchical cross-modal talking face generation with dynamic pixel-wise loss,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 7832–7841.
- A. Lahiri, V. Kwatra, C. Frueh, J. Lewis, and C. Bregler, “Lipsync3d: Data-efficient learning of personalized 3d talking faces from video using pose and lighting normalization,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2021, pp. 2755–2764.
- Z. Zhang, L. Li, Y. Ding, and C. Fan, “Flow-guided one-shot talking face generation with a high-resolution audio-visual dataset,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2021, pp. 3661–3670.
- Y. Ren, G. Li, Y. Chen, T. H. Li, and S. Liu, “Pirenderer: Controllable portrait image generation via semantic neural rendering,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 13 759–13 768.
- S. Wang, L. Li, Y. Ding, and X. Yu, “One-shot talking face generation from single-speaker audio-visual correlation learning,” arXiv preprint arXiv:2112.02749, 2021.
- R. Yi, Z. Ye, J. Zhang, H. Bao, and Y.-J. Liu, “Audio-driven talking face video generation with learning-based personalized head pose,” arXiv preprint arXiv:2002.10137, 2020.
- J. Thies, M. Elgharib, A. Tewari, C. Theobalt, and M. Nießner, “Neural voice puppetry: Audio-driven facial reenactment,” in European conference on computer vision. Springer, 2020, pp. 716–731.
- C. Zhang, Y. Zhao, Y. Huang, M. Zeng, S. Ni, M. Budagavi, and X. Guo, “Facial: Synthesizing dynamic talking face with implicit attribute learning,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 3867–3876.
- B. Egger, W. A. Smith, A. Tewari, S. Wuhrer, M. Zollhoefer, T. Beeler, F. Bernard, T. Bolkart, A. Kortylewski, S. Romdhani et al., “3d morphable face models—past, present, and future,” ACM Transactions on Graphics (TOG), vol. 39, no. 5, pp. 1–38, 2020.
- Y. Deng, J. Yang, S. Xu, D. Chen, Y. Jia, and X. Tong, “Accurate 3d face reconstruction with weakly-supervised learning: From single image to image set,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2019, pp. 0–0.
- F. Yin, Y. Zhang, X. Cun, M. Cao, Y. Fan, X. Wang, Q. Bai, B. Wu, J. Wang, and Y. Yang, “Styleheat: One-shot high-resolution editable talking face generation via pre-trained stylegan,” in Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XVII. Springer, 2022, pp. 85–101.
- C. Wang, “Lip movements information disentanglement for lip sync,” arXiv preprint arXiv:2202.06198, 2022.
- A. Gupta, R. Mukhopadhyay, S. Balachandra, F. F. Khan, V. P. Namboodiri, and C. Jawahar, “Towards generating ultra-high resolution talking-face videos with lip synchronization,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2023, pp. 5209–5218.
- Y. Guo, K. Chen, S. Liang, Y.-J. Liu, H. Bao, and J. Zhang, “Ad-nerf: Audio driven neural radiance fields for talking head synthesis,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 5784–5794.
- S. Yao, R. Zhong, Y. Yan, G. Zhai, and X. Yang, “Dfa-nerf: personalized talking head generation via disentangled face attributes neural rendering,” arXiv preprint arXiv:2201.00791, 2022.
- Z. Ye, Z. Jiang, Y. Ren, J. Liu, J. He, and Z. Zhao, “Geneface: Generalized and high-fidelity audio-driven 3d talking face synthesis,” arXiv preprint arXiv:2301.13430, 2023.
- S. J. Park, M. Kim, J. Hong, J. Choi, and Y. M. Ro, “Synctalkface: Talking face generation with precise lip-syncing via audio-lip memory,” in 36th AAAI Conference on Artificial Intelligence (AAAI 22). Association for the Advancement of Artificial Intelligence, 2022.
- X. Wen, M. Wang, C. Richardt, Z.-Y. Chen, and S.-M. Hu, “Photorealistic audio-driven video portraits,” IEEE Transactions on Visualization and Computer Graphics, vol. 26, no. 12, pp. 3457–3466, 2020.
- B. Liang, Y. Pan, Z. Guo, H. Zhou, Z. Hong, X. Han, J. Han, J. Liu, E. Ding, and J. Wang, “Expressive talking head generation with granular audio-visual control,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 3387–3396.
- Y. Lu, J. Chai, and X. Cao, “Live Speech Portraits: Real-time photorealistic talking-head animation,” ACM Transactions on Graphics, vol. 40, no. 6, 2021.
- X. Ji, H. Zhou, K. Wang, W. Wu, C. C. Loy, X. Cao, and F. Xu, “Audio-driven emotional video portraits,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 14 080–14 089.
- E. Zakharov, A. Shysheya, E. Burkov, and V. Lempitsky, “Few-shot adversarial learning of realistic neural talking head models,” in Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 9459–9468.
- S. Wang, L. Li, Y. Ding, C. Fan, and X. Yu, “Audio2head: Audio-driven one-shot talking-head generation with natural head motion,” arXiv preprint arXiv:2107.09293, 2021.
- X. Ji, H. Zhou, K. Wang, Q. Wu, W. Wu, F. Xu, and X. Cao, “Eamm: One-shot emotional talking face via audio-based emotion-aware motion model,” in ACM SIGGRAPH 2022 Conference Proceedings, 2022, pp. 1–10.
- L. Song, W. Wu, C. Qian, R. He, and C. C. Loy, “Everybody’s talkin’: Let me talk as you want,” IEEE Transactions on Information Forensics and Security, vol. 17, pp. 585–598, 2022.
- T. Lv, Y.-H. Wen, Z. Sun, Z. Ye, and Y.-J. Liu, “Generating smooth and facial-details-enhanced talking head video: A perspective of pre and post processes,” in Proceedings of the 30th ACM International Conference on Multimedia, 2022, pp. 7079–7083.
- R. Huang, W. Zhong, and G. Li, “Audio-driven talking head generation with transformer and 3d morphable model,” in Proceedings of the 30th ACM International Conference on Multimedia, 2022, pp. 7035–7039.
- T. Li, T. Bolkart, M. J. Black, H. Li, and J. Romero, “Learning a model of facial shape and expression from 4D scans,” ACM Transactions on Graphics, (Proc. SIGGRAPH Asia), vol. 36, no. 6, pp. 194:1–194:17, 2017. [Online]. Available: https://doi.org/10.1145/3130800.3130813
- A. Richard, M. Zollhöfer, Y. Wen, F. De la Torre, and Y. Sheikh, “Meshtalk: 3d face animation from speech using cross-modality disentanglement,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 1173–1182.
- Y. Fan, Z. Lin, J. Saito, W. Wang, and T. Komura, “Faceformer: Speech-driven 3d facial animation with transformers,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
- S. Shen, W. Li, Z. Zhu, Y. Duan, J. Zhou, and J. Lu, “Learning dynamic facial radiance fields for few-shot talking head synthesis,” in Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XII. Springer, 2022, pp. 666–682.
- V. Blanz and T. Vetter, “A morphable model for the synthesis of 3d faces,” in Proceedings of the 26th annual conference on Computer graphics and interactive techniques, 1999, pp. 187–194.
- A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” Advances in Neural Information Processing Systems, vol. 33, pp. 12 449–12 460, 2020.
- J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
- O. Press, N. A. Smith, and M. Lewis, “Train short, test long: Attention with linear biases enables input length extrapolation,” arXiv preprint arXiv:2108.12409, 2021.
- R. Yi, Z. Ye, J. Zhang, H. Bao, and Y. Liu, “Audio-driven talking face video generation with natural head pose,” CoRR, vol. abs/2002.10137, 2020. [Online]. Available: https://arxiv.org/abs/2002.10137
- H. Zhou, Y. Sun, W. Wu, C. C. Loy, X. Wang, and Z. Liu, “Pose-controllable talking face generation by implicitly modularized audio-visual representation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 4176–4186.
- A. Siarohin, S. Lathuilière, S. Tulyakov, E. Ricci, and N. Sebe, “First order motion model for image animation,” Advances in Neural Information Processing Systems, vol. 32, 2019.
- K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
- A. Nagrani, J. S. Chung, and A. Zisserman, “Voxceleb: a large-scale speaker identification dataset,” in INTERSPEECH, 2017.
- Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: from error visibility to structural similarity,” IEEE transactions on image processing, vol. 13, no. 4, pp. 600–612, 2004.
- N. D. Narvekar and L. J. Karam, “A no-reference image blur metric based on the cumulative probability of blur detection (cpbd),” IEEE Transactions on Image Processing, vol. 20, no. 9, pp. 2678–2683, 2011.
- L. Chen, Z. Li, R. K. Maddox, Z. Duan, and C. Xu, “Lip movements generation at a glance,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 520–535.
- J. S. Chung and A. Zisserman, “Out of time: automated lip sync in the wild,” in Asian conference on computer vision. Springer, 2016, pp. 251–263.
- P. Paysan, R. Knothe, B. Amberg, S. Romdhani, and T. Vetter, “A 3d face model for pose and illumination invariant face recognition,” in 2009 sixth IEEE international conference on advanced video and signal based surveillance. Ieee, 2009, pp. 296–301.
- H. Wu, J. Jia, H. Wang, Y. Dou, C. Duan, and Q. Deng, “Imitating arbitrary talking style for realistic audio-driven talking face synthesis,” in Proceedings of the 29th ACM International Conference on Multimedia, 2021, pp. 1478–1486.
- A. Rossler, D. Cozzolino, L. Verdoliva, C. Riess, J. Thies, and M. Nießner, “Faceforensics++: Learning to detect manipulated facial images,” in Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 1–11.
- Haiming Zhang (20 papers)
- Zhihao Yuan (15 papers)
- Chaoda Zheng (13 papers)
- Xu Yan (130 papers)
- Baoyuan Wang (46 papers)
- Guanbin Li (177 papers)
- Song Wu (23 papers)
- Shuguang Cui (275 papers)
- Zhen Li (334 papers)