Style-Preserving Lip Sync via Audio-Aware Style Reference (2408.05412v2)
Abstract: Audio-driven lip sync has recently drawn significant attention due to its widespread application in the multimedia domain. Individuals exhibit distinct lip shapes when speaking the same utterance, attributed to the unique speaking styles of individuals, posing a notable challenge for audio-driven lip sync. Earlier methods for such task often bypassed the modeling of personalized speaking styles, resulting in sub-optimal lip sync conforming to the general styles. Recent lip sync techniques attempt to guide the lip sync for arbitrary audio by aggregating information from a style reference video, yet they can not preserve the speaking styles well due to their inaccuracy in style aggregation. This work proposes an innovative audio-aware style reference scheme that effectively leverages the relationships between input audio and reference audio from style reference video to address the style-preserving audio-driven lip sync. Specifically, we first develop an advanced Transformer-based model adept at predicting lip motion corresponding to the input audio, augmented by the style information aggregated through cross-attention layers from style reference video. Afterwards, to better render the lip motion into realistic talking face video, we devise a conditional latent diffusion model, integrating lip motion through modulated convolutional layers and fusing reference facial images via spatial cross-attention layers. Extensive experiments validate the efficacy of the proposed approach in achieving precise lip sync, preserving speaking styles, and generating high-fidelity, realistic talking face videos.
- P. KR, R. Mukhopadhyay, J. Philip, A. Jha, V. Namboodiri, and C. Jawahar, “Towards automatic face-to-face translation,” in Proceedings of the 27th ACM international conference on multimedia, 2019, pp. 1428–1436.
- T. Xie, L. Liao, C. Bi, B. Tang, X. Yin, J. Yang, M. Wang, J. Yao, Y. Zhang, and Z. Ma, “Towards realistic visual dubbing with heterogeneous sources,” in Proceedings of the 29th ACM International Conference on Multimedia, 2021, pp. 1739–1747.
- Z. Zhou, Z. Wang, S. Yao, Y. Yan, C. Yang, G. Zhai, J. Yan, and X. Yang, “Dialoguenerf: Towards realistic avatar face-to-face conversation video generation,” arXiv preprint arXiv:2203.07931, 2022.
- J. Thies, M. Elgharib, A. Tewari, C. Theobalt, and M. Nießner, “Neural voice puppetry: Audio-driven facial reenactment,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVI 16. Springer, 2020, pp. 716–731.
- S. Wang, Y. Ma, Y. Ding, Z. Hu, C. Fan, T. Lv, Z. Deng, and X. Yu, “Styletalk++: A unified framework for controlling the speaking styles of talking heads,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024.
- Y. Ma, S. Wang, Z. Hu, C. Fan, T. Lv, Y. Ding, Z. Deng, and X. Yu, “Styletalk: One-shot talking head generation with controllable speaking styles,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 2, 2023, pp. 1896–1904.
- K. Prajwal, R. Mukhopadhyay, V. P. Namboodiri, and C. Jawahar, “A lip sync expert is all you need for speech to lip generation in the wild,” in Proceedings of the 28th ACM international conference on multimedia, 2020, pp. 484–492.
- W. Zhong, C. Fang, Y. Cai, P. Wei, G. Zhao, L. Lin, and G. Li, “Identity-preserving talking face generation with landmark and appearance priors,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 9729–9738.
- H. Zhou, Y. Sun, W. Wu, C. C. Loy, X. Wang, and Z. Liu, “Pose-controllable talking face generation by implicitly modularized audio-visual representation,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 4176–4186.
- S. Shen, W. Zhao, Z. Meng, W. Li, Z. Zhu, J. Zhou, and J. Lu, “Difftalk: Crafting diffusion models for generalized audio-driven portraits animation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 1982–1991.
- H. Fu, Z. Wang, K. Gong, K. Wang, T. Chen, H. Li, H. Zeng, and W. Kang, “Mimic: Speaking style disentanglement for speech-driven 3d facial animation,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 2, 2024, pp. 1770–1777.
- S. Tan, B. Ji, Y. Ding, and Y. Pan, “Say anything with any style,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 5, 2024, pp. 5088–5096.
- H. Wu, J. Jia, H. Wang, Y. Dou, C. Duan, and Q. Deng, “Imitating arbitrary talking style for realistic audio-driven talking face synthesis,” in Proceedings of the 29th ACM International Conference on Multimedia, 2021, pp. 1478–1486.
- Z. Zhang, L. Li, Y. Ding, and C. Fan, “Flow-guided one-shot talking face generation with a high-resolution audio-visual dataset,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 3661–3670.
- L. Chen, Z. Wu, R. Li, W. Bao, J. Ling, X. Tan, and S. Zhao, “Vast: Vivify your talking avatar via zero-shot expressive facial style transfer,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 2977–2987.
- D. Wang, Y. Deng, Z. Yin, H.-Y. Shum, and B. Wang, “Progressive disentangled representation learning for fine-grained controllable talking head synthesis,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 17 979–17 989.
- Y. Ma, S. Wang, Y. Ding, B. Ma, T. Lv, C. Fan, Z. Hu, Z. Deng, and X. Yu, “Talkclip: Talking head generation with text-guided expressive speaking styles,” arXiv preprint arXiv:2304.00334, 2023.
- D. Li, K. Zhao, W. Wang, B. Peng, Y. Zhang, J. Dong, and T. Tan, “Ae-nerf: Audio enhanced neural radiance field for few shot talking head synthesis,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 4, 2024, pp. 3037–3045.
- V. Blanz and T. Vetter, “A morphable model for the synthesis of 3d faces,” in Proceedings of the 26th annual conference on Computer graphics and interactive techniques, 1999, pp. 187–194.
- R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 10 684–10 695.
- S. Suwajanakorn, S. M. Seitz, and I. Kemelmacher-Shlizerman, “Synthesizing obama: learning lip sync from audio,” ACM Transactions on Graphics (ToG), vol. 36, no. 4, pp. 1–13, 2017.
- Y. Guo, K. Chen, S. Liang, Y.-J. Liu, H. Bao, and J. Zhang, “Ad-nerf: Audio driven neural radiance fields for talking head synthesis,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 5784–5794.
- R. Huang, P. Lai, Y. Qin, and G. Li, “Parametric implicit face representation for audio-driven facial reenactment,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 12 759–12 768.
- C. Du, Q. Chen, T. He, X. Tan, X. Chen, K. Yu, S. Zhao, and J. Bian, “Dae-talker: High fidelity speech-driven talking face generation with diffusion autoencoder,” in Proceedings of the 31st ACM International Conference on Multimedia, 2023, pp. 4281–4289.
- Z. Ye, Z. Jiang, Y. Ren, J. Liu, J. He, and Z. Zhao, “Geneface: Generalized and high-fidelity audio-driven 3d talking face synthesis,” in The Eleventh International Conference on Learning Representations, 2022.
- B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng, “Nerf: Representing scenes as neural radiance fields for view synthesis,” Communications of the ACM, vol. 65, no. 1, pp. 99–106, 2021.
- J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” Advances in neural information processing systems, vol. 33, pp. 6840–6851, 2020.
- I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial networks,” Communications of the ACM, vol. 63, no. 11, pp. 139–144, 2020.
- J. Song, C. Meng, and S. Ermon, “Denoising diffusion implicit models,” in International Conference on Learning Representations, 2020.
- A. Hannun, C. Case, J. Casper, B. Catanzaro, G. Diamos, E. Elsen, R. Prenger, S. Satheesh, S. Sengupta, A. Coates et al., “Deep speech: Scaling up end-to-end speech recognition,” arXiv preprint arXiv:1412.5567, 2014.
- Y. Deng, J. Yang, S. Xu, D. Chen, Y. Jia, and X. Tong, “Accurate 3d face reconstruction with weakly-supervised learning: From single image to image set,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, 2019, pp. 0–0.
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
- Y. Ma, S. Zhang, J. Wang, X. Wang, Y. Zhang, and Z. Deng, “Dreamtalk: When expressive talking head generation meets diffusion probabilistic models,” arXiv preprint arXiv:2312.09767, 2023.
- S. Wang, L. Li, Y. Ding, and X. Yu, “One-shot talking face generation from single-speaker audio-visual correlation learning,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 3, 2022, pp. 2531–2539.
- A. Bulat and G. Tzimiropoulos, “How far are we from solving the 2d & 3d face alignment problem? (and a dataset of 230,000 3d facial landmarks),” in International Conference on Computer Vision, 2017.
- O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18. Springer, 2015, pp. 234–241.
- T. Karras, S. Laine, M. Aittala, J. Hellsten, J. Lehtinen, and T. Aila, “Analyzing and improving the image quality of stylegan,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 8110–8119.
- L. Hu, X. Gao, P. Zhang, K. Sun, B. Zhang, and L. Bo, “Animate anyone: Consistent and controllable image-to-video synthesis for character animation,” arXiv preprint arXiv:2311.17117, 2023.
- A. K. Bhunia, S. Khan, H. Cholakkal, R. M. Anwer, J. Laaksonen, M. Shah, and F. S. Khan, “Person image synthesis via denoising diffusion model,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 5968–5976.
- A. Nagrani, J. S. Chung, and A. Zisserman, “Voxceleb: a large-scale speaker identification dataset,” in INTERSPEECH, 2017.
- D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
- R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “latent-diffusion,” https://github.com/CompVis/latent-diffusion, 2022.
- L. Chen, Z. Li, R. K. Maddox, Z. Duan, and C. Xu, “Lip movements generation at a glance,” in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 520–535.
- J. S. Chung and A. Zisserman, “Out of time: automated lip sync in the wild,” in Asian conference on computer vision. Springer, 2016, pp. 251–263.
- Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: from error visibility to structural similarity,” IEEE transactions on image processing, vol. 13, no. 4, pp. 600–612, 2004.
- R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, “The unreasonable effectiveness of deep features as a perceptual metric,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 586–595.
- M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, “Gans trained by a two time-scale update rule converge to a local nash equilibrium,” Advances in neural information processing systems, vol. 30, 2017.
- R. Zhen, W. Song, Q. He, J. Cao, L. Shi, and J. Luo, “Human-computer interaction system: A survey of talking-head generation,” Electronics, vol. 12, no. 1, p. 218, 2023.