Learn2Talk: 3D Talking Face Learns from 2D Talking Face (2404.12888v1)
Abstract: Speech-driven facial animation methods usually contain two main classes, 3D and 2D talking face, both of which attract considerable research attention in recent years. However, to the best of our knowledge, the research on 3D talking face does not go deeper as 2D talking face, in the aspect of lip-synchronization (lip-sync) and speech perception. To mind the gap between the two sub-fields, we propose a learning framework named Learn2Talk, which can construct a better 3D talking face network by exploiting two expertise points from the field of 2D talking face. Firstly, inspired by the audio-video sync network, a 3D sync-lip expert model is devised for the pursuit of lip-sync between audio and 3D facial motion. Secondly, a teacher model selected from 2D talking face methods is used to guide the training of the audio-to-3D motions regression network to yield more 3D vertex accuracy. Extensive experiments show the advantages of the proposed framework in terms of lip-sync, vertex accuracy and speech perception, compared with state-of-the-arts. Finally, we show two applications of the proposed framework: audio-visual speech recognition and speech-driven 3D Gaussian Splatting based avatar animation.
- B. Kerbl, G. Kopanas, T. Leimkühler, and G. Drettakis, “3d gaussian splatting for real-time radiance field rendering,” ACM Trans. Graph., vol. 42, no. 4, pp. 139:1–139:14, 2023.
- C. Sheng, G. Kuang, L. Bai, C. Hou, Y. Guo, X. Xu, M. Pietikäinen, and L. Liu, “Deep learning for visual speech analysis: A survey,” arXiv preprint arXiv:2205.10839, 2022.
- I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. C. Courville, and Y. Bengio, “Generative adversarial nets,” in Proc. of NIPS, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, Eds., 2014, pp. 2672–2680.
- T. Karras, S. Laine, and T. Aila, “A style-based generator architecture for generative adversarial networks,” in Proc. of CVPR, 2019, pp. 4401–4410.
- R. Liu, C. Li, H. Cao, Y. Zheng, M. Zeng, and X. Cheng, “EMEF: ensemble multi-exposure image fusion,” in Proc. of AAAI, 2023, pp. 1710–1718.
- J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” in Proc. of NIPS, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, Eds., 2020.
- R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” in Proc. of CVPR, 2022, pp. 10 674–10 685.
- R. Liu, B. Ma, W. Zhang, Z. Hu, C. Fan, T. Lv, Y. Ding, and X. Cheng, “Towards a simultaneous and granular identity-expression control in personalized face generation,” arXiv preprint arXiv:2401.01207, 2024.
- J. S. Chung, A. Jamaludin, and A. Zisserman, “You said that?” in Proc. of BMVC, 2017.
- L. Chen, Z. Li, R. K. Maddox, Z. Duan, and C. Xu, “Lip movements generation at a glance,” in Proc. of ECCV, V. Ferrari, M. Hebert, C. Sminchisescu, and Y. Weiss, Eds., vol. 11211, 2018, pp. 538–553.
- K. R. Prajwal, R. Mukhopadhyay, V. P. Namboodiri, and C. V. Jawahar, “A lip sync expert is all you need for speech to lip generation in the wild,” in Proc. of ACM MM, C. W. Chen, R. Cucchiara, X. Hua, G. Qi, E. Ricci, Z. Zhang, and R. Zimmermann, Eds., 2020, pp. 484–492.
- K. R. Prajwal, R. Mukhopadhyay, J. Philip, A. Jha, V. P. Namboodiri, and C. V. Jawahar, “Towards automatic face-to-face translation,” in Proc. of ACM MM, L. Amsaleg, B. Huet, M. A. Larson, G. Gravier, H. Hung, C. Ngo, and W. T. Ooi, Eds., 2019, pp. 1428–1436.
- K. Vougioukas, S. Petridis, and M. Pantic, “Realistic speech-driven facial animation with gans,” Int. J. Comput. Vis., vol. 128, no. 5, pp. 1398–1413, 2020.
- H. Zhou, Y. Sun, W. Wu, C. C. Loy, X. Wang, and Z. Liu, “Pose-controllable talking face generation by implicitly modularized audio-visual representation,” in Proc. of CVPR, 2021, pp. 4176–4186.
- L. Chen, R. K. Maddox, Z. Duan, and C. Xu, “Hierarchical cross-modal talking face generation with dynamic pixel-wise loss,” in Proc. of CVPR, 2019, pp. 7832–7841.
- L. Chen, G. Cui, C. Liu, Z. Li, Z. Kou, Y. Xu, and C. Xu, “Talking-head generation with rhythmic head motion,” in Proc. of ECCV, A. Vedaldi, H. Bischof, T. Brox, and J. Frahm, Eds., vol. 12354, 2020, pp. 35–51.
- Y. Zhou, X. Han, E. Shechtman, J. Echevarria, E. Kalogerakis, and D. Li, “Makelttalk: speaker-aware talking-head animation,” ACM Trans. Graph., vol. 39, no. 6, pp. 221:1–221:15, 2020.
- Y. Lu, J. Chai, and X. Cao, “Live speech portraits: real-time photorealistic talking-head animation,” ACM Trans. Graph., vol. 40, no. 6, pp. 220:1–220:17, 2021.
- Z. Zhang, L. Li, Y. Ding, and C. Fan, “Flow-guided one-shot talking face generation with a high-resolution audio-visual dataset,” in Proc. of CVPR, 2021, pp. 3661–3670.
- W. Zhang, X. Cun, X. Wang, Y. Zhang, X. Shen, Y. Guo, Y. Shan, and F. Wang, “Sadtalker: Learning realistic 3d motion coefficients for stylized audio-driven single image talking face animation,” in Proc. of CVPR, 2023, pp. 8652–8661.
- Z. Zhang, Z. Hu, W. Deng, C. Fan, T. Lv, and Y. Ding, “Dinet: Deformation inpainting network for realistic face visually dubbing on high resolution video,” in Proc. of AAAI, B. Williams, Y. Chen, and J. Neville, Eds., 2023, pp. 3543–3551.
- D. Cudeiro, T. Bolkart, C. Laidlaw, A. Ranjan, and M. J. Black, “Capture, learning, and synthesis of 3d speaking styles,” in Proc. of CVPR, 2019, pp. 10 101–10 111.
- A. Richard, M. Zollhöfer, Y. Wen, F. D. la Torre, and Y. Sheikh, “Meshtalk: 3d face animation from speech using cross-modality disentanglement,” in Proc. of ICCV, 2021, pp. 1153–1162.
- Y. Fan, Z. Lin, J. Saito, W. Wang, and T. Komura, “Faceformer: Speech-driven 3d facial animation with transformers,” in Proc. of CVPR, 2022, pp. 18 749–18 758.
- J. Xing, M. Xia, Y. Zhang, X. Cun, J. Wang, and T. Wong, “Codetalker: Speech-driven 3d facial animation with discrete motion prior,” in Proc. of CVPR, 2023, pp. 12 780–12 790.
- Y. Zhou, Z. Xu, C. Landreth, E. Kalogerakis, S. Maji, and K. Singh, “Visemenet: audio-driven animator-centric speech animation,” ACM Trans. Graph., vol. 37, no. 4, p. 161, 2018.
- M. V. Aylagas, H. A. Leon, M. Teye, and K. Tollmar, “Voice2face: Audio-driven facial and tongue rig animations with cvaes,” Comput. Graph. Forum, vol. 41, no. 8, pp. 255–265, 2022.
- S. Stan, K. I. Haque, and Z. Yumak, “Facediffuser: Speech-driven 3d facial animation synthesis using diffusion,” in ACM Conference on Motion, Interaction and Games, J. Pettré, B. Solenthaler, R. McDonnell, and C. Peters, Eds., 2023, pp. 13:1–13:11.
- H. Yi, H. Liang, Y. Liu, Q. Cao, Y. Wen, T. Bolkart, D. Tao, and M. J. Black, “Generating holistic 3d human motion from speech,” in Proc. of CVPR, 2023, pp. 469–480.
- T. Ao, Q. Gao, Y. Lou, B. Chen, and L. Liu, “Rhythmic gesticulator: Rhythm-aware co-speech gesture synthesis with hierarchical neural embeddings,” ACM Trans. Graph., vol. 41, no. 6, pp. 209:1–209:19, 2022.
- S. Alexanderson, R. Nagy, J. Beskow, and G. E. Henter, “Listen, denoise, action! audio-driven motion synthesis with diffusion models,” ACM Trans. Graph., vol. 42, no. 4, pp. 44:1–44:20, 2023.
- N. Athanasiou, M. Petrovich, M. J. Black, and G. Varol, “TEACH: temporal action composition for 3d humans,” in Proc. of 3DV, 2022, pp. 414–423.
- Z. Zhou and B. Wang, “UDE: A unified driving engine for human motion generation,” in Proc. of CVPR, 2023, pp. 5632–5641.
- J. Lin, J. Chang, L. Liu, G. Li, L. Lin, Q. Tian, and C. W. Chen, “Being comes from not-being: Open-vocabulary text-to-motion generation with wordless training,” in Proc. of CVPR, 2023, pp. 23 222–23 231.
- Z. Wan, Z. Li, M. Tian, J. Liu, S. Yi, and H. Li, “Encoder-decoder with multi-level attention for 3d human shape and pose estimation,” in Proc. of ICCV, 2021, pp. 13 013–13 022.
- Y. Yuan, S. Wei, T. Simon, K. Kitani, and J. M. Saragih, “Simpoe: Simulated character control for 3d human pose estimation,” in Proc. of CVPR, 2021, pp. 7159–7169.
- W. Wei, J. Lin, T. Liu, and H. M. Liao, “Capturing humans in motion: Temporal-attentive 3d human pose and shape estimation from monocular video,” in Proc. of CVPR, 2022, pp. 13 201–13 210.
- J. S. Chung and A. Zisserman, “Out of time: Automated lip sync in the wild,” in Proc. of ACCV Workshop, C. Chen, J. Lu, and K. Ma, Eds., vol. 10117, 2016, pp. 251–263.
- P. Ma, S. Petridis, and M. Pantic, “Visual speech recognition for multiple languages in the wild,” Nat. Mac. Intell., vol. 4, no. 11, pp. 930–939, 2022.
- T. Karras, T. Aila, S. Laine, A. Herva, and J. Lehtinen, “Audio-driven facial animation by joint end-to-end learning of pose and emotion,” ACM Trans. Graph., vol. 36, no. 4, pp. 94:1–94:12, 2017.
- J. Liu, B. Hui, K. Li, Y. Liu, Y. Lai, Y. Zhang, Y. Liu, and J. Yang, “Geometry-guided dense perspective network for speech-driven facial animation,” IEEE Trans. Vis. Comput. Graph., vol. 28, no. 12, pp. 4873–4886, 2022.
- A. van den Oord, O. Vinyals, and K. Kavukcuoglu, “Neural discrete representation learning,” in Proc. of NIPS, I. Guyon, U. von Luxburg, S. Bengio, H. M. Wallach, R. Fergus, S. V. N. Vishwanathan, and R. Garnett, Eds., 2017, pp. 6306–6315.
- Z. Sun, T. Lv, S. Ye, M. G. Lin, J. Sheng, Y.-H. Wen, M. Yu, and Y. jin Liu, “Diffposetalk: Speech-driven stylistic 3d facial animation and head pose generation via diffusion models,” arXiv preprint arXiv:2310.00434, 2023.
- S. Sanyal, T. Bolkart, H. Feng, and M. J. Black, “Learning to regress 3d face shape and expression from an image without 3d supervision,” in Proc. of CVPR, 2019, pp. 7763–7772.
- Y. Deng, J. Yang, S. Xu, D. Chen, Y. Jia, and X. Tong, “Accurate 3d face reconstruction with weakly-supervised learning: From single image to image set,” in Proc. of CVPR Workshops, 2019, pp. 285–295.
- Y. Feng, H. Feng, M. J. Black, and T. Bolkart, “Learning an animatable detailed 3d face model from in-the-wild images,” ACM Trans. Graph., vol. 40, no. 4, pp. 88:1–88:13, 2021.
- R. Danecek, M. J. Black, and T. Bolkart, “EMOCA: emotion driven monocular face capture and animation,” in Proc. of CVPR, 2022, pp. 20 279–20 290.
- R. Liu, Y. Cheng, S. Huang, C. Li, and X. Cheng, “Transformer-based high-fidelity facial displacement completion for detailed 3d face reconstruction,” IEEE Trans. on Multim., pp. 1–13, 2023.
- C. Cao, Q. Hou, and K. Zhou, “Displaced dynamic expression regression for real-time facial tracking and animation,” ACM Trans. Graph., vol. 33, no. 4, pp. 43:1–43:10, 2014.
- C. Cao, D. Bradley, K. Zhou, and T. Beeler, “Real-time high-fidelity facial performance capture,” ACM Trans. Graph., vol. 34, no. 4, pp. 46:1–46:9, 2015.
- C. Wang, F. Shi, S. Xia, and J. Chai, “Realtime 3d eye gaze animation using a single RGB camera,” ACM Trans. Graph., vol. 35, no. 4, pp. 118:1–118:14, 2016.
- P. Garrido, M. Zollhöfer, D. Casas, L. Valgaerts, K. Varanasi, P. Pérez, and C. Theobalt, “Reconstruction of personalized 3d face rigs from monocular video,” ACM Trans. Graph., vol. 35, no. 3, pp. 28:1–28:15, 2016.
- P. P. Filntisis, G. Retsinas, F. P. Papantoniou, A. Katsamanis, A. Roussos, and P. Maragos, “Visual speech-aware perceptual 3d facial expression reconstruction from videos,” arXiv preprint arXiv:2207.11094, 2022.
- V. Blanz and T. Vetter, “A morphable model for the synthesis of 3d faces,” in Proc. of SIGGRAPH, 1999, pp. 187–194.
- D. Vlasic, M. Brand, H. Pfister, and J. Popovic, “Face transfer with multilinear models,” ACM Trans. Graph., vol. 24, no. 3, pp. 426–433, 2005.
- Z. Wang, J. Chai, and S. Xia, “Realtime and accurate 3d eye gaze capture with dcnn-based iris and pupil segmentation,” IEEE Trans. Vis. Comput. Graph., vol. 27, no. 1, pp. 190–203, 2021.
- A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” in Proc. of NIPS, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, Eds., 2020.
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in Proc. of NIPS, I. Guyon, U. von Luxburg, S. Bengio, H. M. Wallach, R. Fergus, S. V. N. Vishwanathan, and R. Garnett, Eds., 2017, pp. 5998–6008.
- S. Wang, L. Li, Y. Ding, and X. Yu, “One-shot talking face generation from single-speaker audio-visual correlation learning,” in Proc. of AAAI, 2022, pp. 2531–2539.
- M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, “Gans trained by a two time-scale update rule converge to a local nash equilibrium,” in Proc. of NIPS, I. Guyon, U. von Luxburg, S. Bengio, H. M. Wallach, R. Fergus, S. V. N. Vishwanathan, and R. Garnett, Eds., 2017, pp. 6626–6637.
- N. D. Narvekar and L. J. Karam, “A no-reference image blur metric based on the cumulative probability of blur detection (CPBD),” IEEE Trans. Image Process., vol. 20, no. 9, pp. 2678–2683, 2011.
- A. Gulati, J. Qin, C. Chiu, N. Parmar, Y. Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y. Wu, and R. Pang, “Conformer: Convolution-augmented transformer for speech recognition,” in Proc. of Interspeech, H. Meng, B. Xu, and T. F. Zheng, Eds., 2020, pp. 5036–5040.
- D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” in Proc. of ICLR, Y. Bengio and Y. LeCun, Eds., 2014.
- G. Fanelli, J. Gall, H. Romsdorfer, T. Weise, and L. V. Gool, “A 3-d audio-visual corpus of affective communication,” IEEE Trans. Multim., vol. 12, no. 6, pp. 591–598, 2010.
- T. Li, T. Bolkart, M. J. Black, H. Li, and J. Romero, “Learning a model of facial shape and expression from 4d scans,” ACM Trans. Graph., vol. 36, no. 6, pp. 194:1–194:17, 2017.
- P. Ma, A. Haliassos, A. Fernandez-Lopez, H. Chen, S. Petridis, and M. Pantic, “Auto-avsr: Audio-visual speech recognition with automatic labels,” in Proc. of ICASSP, 2023, pp. 1–5.
- G. Chen and W. Wang, “A survey on 3d gaussian splatting,” arXiv preprint arXiv:2401.03890, 2024.
- B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng, “Nerf: Representing scenes as neural radiance fields for view synthesis,” in Proc. of ECCV, A. Vedaldi, H. Bischof, T. Brox, and J. Frahm, Eds., vol. 12346, 2020, pp. 405–421.
- S. Qian, T. Kirschstein, L. Schoneveld, D. Davoli, S. Giebenhain, and M. Nießner, “Gaussianavatars: Photorealistic head avatars with rigged 3d gaussians,” arXiv preprint arXiv:2312.02069, 2023.
- Y. Xu, B. Chen, Z. Li, H. Zhang, L. Wang, Z. Zheng, and Y. Liu, “Gaussian head avatar: Ultra high-fidelity head avatar via dynamic gaussians,” arXiv preprint arXiv:2312.03029, 2023.
- Z. Li, Z. Zheng, L. Wang, and Y. Liu, “Animatable gaussians: Learning pose-dependent gaussian maps for high-fidelity human avatar modeling,” arXiv preprint arXiv:2311.16096, 2023.
- Y. Jiang, Q. Liao, X. Li, L. Ma, Q. Zhang, C. Zhang, Z. Lu, and Y. Shan, “Uv gaussians: Joint learning of mesh deformation and gaussian textures for human avatar modeling,” arXiv preprint arXiv:2403.11589, 2024.