Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

AVI-Talking: Learning Audio-Visual Instructions for Expressive 3D Talking Face Generation (2402.16124v1)

Published 25 Feb 2024 in cs.CV

Abstract: While considerable progress has been made in achieving accurate lip synchronization for 3D speech-driven talking face generation, the task of incorporating expressive facial detail synthesis aligned with the speaker's speaking status remains challenging. Our goal is to directly leverage the inherent style information conveyed by human speech for generating an expressive talking face that aligns with the speaking status. In this paper, we propose AVI-Talking, an Audio-Visual Instruction system for expressive Talking face generation. This system harnesses the robust contextual reasoning and hallucination capability offered by LLMs to instruct the realistic synthesis of 3D talking faces. Instead of directly learning facial movements from human speech, our two-stage strategy involves the LLMs first comprehending audio information and generating instructions implying expressive facial details seamlessly corresponding to the speech. Subsequently, a diffusion-based generative network executes these instructions. This two-stage process, coupled with the incorporation of LLMs, enhances model interpretability and provides users with flexibility to comprehend instructions and specify desired operations or modifications. Extensive experiments showcase the effectiveness of our approach in producing vivid talking faces with expressive facial movements and consistent emotional status.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (80)
  1. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022.
  2. Spice: Semantic propositional image caption evaluation. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part V 14, pages 382–398. Springer, 2016.
  3. Facetalk: Audio-driven motion diffusion for neural parametric head models. arXiv preprint arXiv:2312.08459, 2023.
  4. Anonymous. Language model beats diffusion - tokenizer is key to visual generation. In The Twelfth International Conference on Learning Representations, 2024.
  5. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems, 33:12449–12460, 2020.
  6. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pages 65–72, 2005.
  7. Talking-head generation with rhythmic head motion. In European Conference on Computer Vision, pages 35–51. Springer, 2020.
  8. Capture, learning, and synthesis of 3d speaking styles. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10101–10111, 2019.
  9. Emoca: Emotion driven monocular face capture and animation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20311–20322, 2022.
  10. Emotional speech-driven animation with content-emotion disentanglement. In SIGGRAPH Asia 2023 Conference Papers, pages 1–13, 2023.
  11. Taming transformers for high-resolution image synthesis, 2020.
  12. Faceformer: Speech-driven 3d facial animation with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18770–18780, 2022.
  13. Efficient emotional adaptation for audio-driven talking-head generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 22634–22645, 2023.
  14. Ad-nerf: Audio driven neural radiance fields for talking head synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5784–5794, 2021.
  15. Space: Speech-driven portrait animation with controllable expression. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 20914–20923, 2023.
  16. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017.
  17. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
  18. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:3451–3460, 2021.
  19. Jie Huang and Kevin Chen-Chuan Chang. Towards reasoning in large language models: A survey. arXiv preprint arXiv:2212.10403, 2022.
  20. Audiogpt: Understanding and generating speech, music, sound, and talking head. arXiv preprint arXiv:2304.12995, 2023.
  21. Eamm: One-shot emotional talking face via audio-based emotion-aware motion model. In ACM SIGGRAPH 2022 Conference Proceedings, pages 1–10, 2022.
  22. Audio-driven emotional video portraits. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14080–14089, 2021.
  23. Audio-driven facial animation by joint end-to-end learning of pose and emotion. ACM Transactions on Graphics (TOG), 36(4):1–12, 2017.
  24. BDDM: Bilateral denoising diffusion models for fast and high-quality speech synthesis. In International Conference on Learning Representations, 2022.
  25. Retrieval-augmented generation for knowledge-intensive nlp tasks. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS’20, Red Hook, NY, USA, 2020. Curran Associates Inc.
  26. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023.
  27. Learning a model of facial shape and expression from 4D scans. ACM Transactions on Graphics, (Proc. SIGGRAPH Asia), 36(6):194:1–194:17, 2017.
  28. Expressive talking head generation with granular audio-visual control. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3387–3396, 2022.
  29. Mind the gap: Understanding the modality gap in multi-modal contrastive representation learning. Advances in Neural Information Processing Systems, 35:17612–17625, 2022.
  30. Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81, 2004.
  31. The ryerson audio-visual database of emotional speech and song (ravdess): A dynamic, multimodal set of facial and vocal expressions in north american english. PloS one, 13(5):e0196391, 2018.
  32. Talkclip: Talking head generation with text-guided expressive speaking styles. arXiv preprint arXiv:2304.00334, 2023.
  33. Styletalk: One-shot talking head generation with controllable speaking styles. arXiv preprint arXiv:2301.01081, 2023.
  34. Dreamtalk: When expressive talking head generation meets diffusion probabilistic models. arXiv preprint arXiv:2312.09767, 2023.
  35. Arabic speech emotion recognition employing wav2vec2. 0 and hubert based on baved dataset. arXiv preprint arXiv:2110.04425, 2021.
  36. Can language models learn to listen? In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10083–10093, 2023.
  37. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
  38. Training language models to follow instructions with human feedback. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 27730–27744. Curran Associates, Inc., 2022.
  39. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002.
  40. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019.
  41. Emotalk: Speech-driven emotional disentanglement for 3d face animation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 20687–20697, 2023.
  42. A lip sync expert is all you need for speech to lip generation in the wild. In Proceedings of the 28th ACM international conference on multimedia, pages 484–492, 2020.
  43. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022.
  44. Diffusion motion: Generate text-guided 3d human motion by diffusion model. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023.
  45. Audio-and gaze-driven facial animation of codec avatars. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 41–50, 2021.
  46. Meshtalk: 3d face animation from speech using cross-modality disentanglement. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1173–1182, 2021.
  47. Speech-driven expressive talking lips with conditional sequential generative adversarial networks. IEEE Transactions on Affective Computing, 12(4):1031–1044, 2019.
  48. Toolformer: Language models can teach themselves to use tools. arXiv preprint arXiv:2302.04761, 2023.
  49. Emotion-controllable generalized talking face generation. arXiv preprint arXiv:2205.01155, 2022.
  50. Imagebrush: Learning visual in-context instructions for exemplar-based image manipulation. arXiv preprint arXiv:2308.00906, 2023.
  51. Speech2talking-face: Inferring and driving a face with synchronized audio-visual representation. In IJCAI, volume 2, page 4, 2021.
  52. Masked lip-sync prediction by audio-visual contextual exploitation in transformers. In SIGGRAPH Asia 2022 Conference Papers, pages 1–9, 2022.
  53. Diffposetalk: Speech-driven stylistic 3d facial animation and head pose generation via diffusion models. arXiv preprint arXiv:2310.00434, 2023.
  54. Emmn: Emotional motion memory network for audio-driven emotional talking face generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 22146–22156, 2023.
  55. Human motion diffusion model. arXiv preprint arXiv:2209.14916, 2022.
  56. Imitator: Personalized speech-driven 3d facial animation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 20621–20631, 2023.
  57. Neural voice puppetry: Audio-driven facial reenactment. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVI 16, pages 716–731. Springer, 2020.
  58. Llama: Open and efficient foundation language models. ArXiv, abs/2302.13971, 2023.
  59. Conditional image generation with pixelcnn decoders. Advances in neural information processing systems, 29, 2016.
  60. Realistic speech-driven facial animation with gans. International Journal of Computer Vision, 128:1398–1413, 2020.
  61. Agentavatar: Disentangling planning, driving and rendering for photorealistic avatar agents. arXiv preprint arXiv:2311.17465, 2023.
  62. Mead: A large-scale audio-visual dataset for emotional talking-face generation. In European Conference on Computer Vision, pages 700–717. Springer, 2020.
  63. Mead: A large-scale audio-visual dataset for emotional talking-face generation. In ECCV, 2020.
  64. A survey on large language model based autonomous agents. arXiv preprint arXiv:2308.11432, 2023.
  65. Describing like humans: on diversity in image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4195–4203, 2019.
  66. Emergent abilities of large language models. arXiv preprint arXiv:2206.07682, 2022.
  67. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022.
  68. X2face: A network for controlling face generation using images, audio, and pose codes. In Proceedings of the European conference on computer vision (ECCV), pages 670–686, 2018.
  69. Imitating arbitrary talking style for realistic audio-driven talking face synthesis. In Proceedings of the 29th ACM International Conference on Multimedia, pages 1478–1486, 2021.
  70. Next-gpt: Any-to-any multimodal llm, 2023.
  71. Codetalker: Speech-driven 3d facial animation with discrete motion prior. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12780–12790, 2023.
  72. High-fidelity generalized emotional talking face generation with multi-modal emotion space learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6609–6619, 2023.
  73. Secap: Speech emotion captioning with large language model. arXiv preprint arXiv:2312.10381, 2023.
  74. Talking head generation with probabilistic audio-to-visual diffusion priors. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7645–7655, 2023.
  75. GLM-130b: An open bilingual pre-trained model. In The Eleventh International Conference on Learning Representations, 2023.
  76. Opt: Open pre-trained transformer language models. ArXiv, abs/2205.01068, 2022.
  77. Sadtalker: Learning realistic 3d motion coefficients for stylized audio-driven single image talking face animation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8652–8661, 2023.
  78. Minigpt-5: Interleaved vision-and-language generation via generative vokens, 2023.
  79. Pose-controllable talking face generation by implicitly modularized audio-visual representation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4176–4186, 2021.
  80. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Yasheng Sun (12 papers)
  2. Wenqing Chu (16 papers)
  3. Hang Zhou (166 papers)
  4. Kaisiyuan Wang (14 papers)
  5. Hideki Koike (15 papers)
Citations (2)

Summary

We haven't generated a summary for this paper yet.

Youtube Logo Streamline Icon: https://streamlinehq.com