Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
120 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
3 tokens/sec
DeepSeek R1 via Azure Pro
55 tokens/sec
2000 character limit reached

DiffSHEG: A Diffusion-Based Approach for Real-Time Speech-driven Holistic 3D Expression and Gesture Generation (2401.04747v2)

Published 9 Jan 2024 in cs.SD, cs.AI, cs.CV, cs.GR, and eess.AS

Abstract: We propose DiffSHEG, a Diffusion-based approach for Speech-driven Holistic 3D Expression and Gesture generation with arbitrary length. While previous works focused on co-speech gesture or expression generation individually, the joint generation of synchronized expressions and gestures remains barely explored. To address this, our diffusion-based co-speech motion generation transformer enables uni-directional information flow from expression to gesture, facilitating improved matching of joint expression-gesture distributions. Furthermore, we introduce an outpainting-based sampling strategy for arbitrary long sequence generation in diffusion models, offering flexibility and computational efficiency. Our method provides a practical solution that produces high-quality synchronized expression and gesture generation driven by speech. Evaluated on two public datasets, our approach achieves state-of-the-art performance both quantitatively and qualitatively. Additionally, a user study confirms the superiority of DiffSHEG over prior approaches. By enabling the real-time generation of expressive and synchronized motions, DiffSHEG showcases its potential for various applications in the development of digital humans and embodied agents.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (53)
  1. Listen, denoise, action! audio-driven motion synthesis with diffusion models. ACM Trans. Graph., 42(4):44:1–44:20, 2023.
  2. Rhythmic gesticulator: Rhythm-aware co-speech gesture synthesis with hierarchical neural embeddings. ACM Transactions on Graphics (TOG), 41(6):1–19, 2022.
  3. Gesturediffuclip: Gesture diffusion model with clip latents. ACM Trans. Graph, 2023.
  4. wav2vec 2.0: A framework for self-supervised learning of speech representations. arXiv preprint arXiv:2006.11477, 2020.
  5. Speech2affectivegestures: Synthesizing co-speech gestures with generative adversarial affective expression learning. In Proceedings of the 29th ACM International Conference on Multimedia, page 2027–2036, New York, NY, USA, 2021a. Association for Computing Machinery.
  6. Text2gestures: A transformer-based network for generating emotive body gestures for virtual agents. CoRR, abs/2101.11101, 2021b.
  7. Expressive speech-driven facial animation. ACM Transactions on Graphics, 24(4):1283–1302, 2005.
  8. Beat: the behavior expression animation toolkit. In Life-Like Characters, pages 163–185. Springer, 2004.
  9. Capture, learning, and synthesis of 3d speaking styles. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 10101–10111, 2019.
  10. Jali: an animator-centric viseme model for expressive lip synchronization. ACM Transactions on Graphics), 35(4):1–11, 2016.
  11. Vid2speech: Speech reconstruction from silent video. 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017.
  12. Faceformers. In In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
  13. Susan Goldin-Meadow. The role of gesture in communication and thinking. Trends in cognitive sciences, 3(11):419–429, 1999.
  14. Learning speech-driven 3d conversational gestures from video. In Proceedings of the 21st ACM International Conference on Intelligent Virtual Agents, pages 101–108, 2021.
  15. Evaluation of speech-to-gesture generation using bi-directional lstm network. In Proceedings of the 18th International Conference on Intelligent Virtual Agents, pages 79–86, 2018.
  16. Denoising Diffusion Probabilistic Models. Advances in Neural Information Processing Systems, 33:6840–6851, 2020.
  17. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:3451–3460, 2021.
  18. Robot behavior toolkit: generating effective social behaviors for robots. In 2012 7th ACM/IEEE International Conference on Human-Robot Interaction (HRI), pages 25–32. IEEE, 2012.
  19. Arbitrary style transfer in real-time with adaptive instance normalization. 2017 IEEE International Conference on Computer Vision (ICCV), pages 1510–1519, 2017.
  20. Modality dropout for improved performance-driven talking faces. In Proceedings of the 2020 International Conference on Multimodal Interaction, pages 378–386, 2020.
  21. Audio-driven facial animation by joint end-to-end learning of pose and emotion. ACM Transactions on Graphics (TOG), 36(4):1–12, 2017.
  22. Michael Kipp. Gesture Generation by Imitation: From Human Behavior to Computer Character Animation. Dissertation.com, Boca Raton, 2004.
  23. Towards a common framework for multimodal generation: The behavior markup language. In International workshop on intelligent virtual agents, pages 205–217. Springer, 2006.
  24. Gesticulator: A framework for semantically-aware speech-driven gesture generation. In Proceedings of the 2020 International Conference on Multimodal Interaction, pages 242–250, 2020.
  25. Ai choreographer: Music conditioned 3d dance generation with aist++. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 13401–13412, 2021.
  26. Beat: A large-scale semantic and emotional multi-modal dataset for conversational gestures synthesis. In ECCV, 2022a.
  27. Learning hierarchical cross-modal association for co-speech gesture generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022b.
  28. Video-audio driven real-time facial animation. ACM Transactions on Graphics, 34(6):1–10, 2015.
  29. Repaint: Inpainting using denoising diffusion probabilistic models. In CVPR, 2022.
  30. Animated speech: research progress and applications. Audiovisual Speech Processing, page 309–345, 2012.
  31. Expressive body capture: 3D hands, face, and body from a single image. In Proceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages 10975–10985, 2019.
  32. End-to-end learning for 3d facial animation from speech. In Proceedings of the ACM International Conference on Multimodal Interaction, pages 361–365, 2018.
  33. Real-time streaming video denoising with bidirectional buffers. In Proceedings of the 30th ACM International Conference on Multimedia, page 2758–2766, New York, NY, USA, 2022. Association for Computing Machinery.
  34. Meshtalk: 3d face animation from speech using cross-modality disentanglement. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1173–1182, 2021.
  35. Difftalk: Crafting diffusion models for generalized talking head synthesis. arXiv preprint arXiv:2301.03786, 2023.
  36. Efficient attention: Attention with linear complexities. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 3531–3539, 2021.
  37. Robotics Softbank. Naoqi api documentation. In 2016 IEEE International Conference on Multimedia and Expo (ICME), vol. http://doc. aldebaran. com/2-5/homepepper. html, 2018.
  38. Denoising diffusion implicit models. In International Conference on Learning Representations, 2021.
  39. Emotion recognition using facial expressions. Procedia Computer Science, 108:1175–1184, 2017. International Conference on Computational Science, ICCS 2017, 12-14 June 2017, Zurich, Switzerland.
  40. A deep learning approach for generalized speech animation. ACM Transactions on Graphics, 36(4):1–11, 2017.
  41. Dynamic units of visual speech. In Proceedings of the ACM SIGGRAPH/Eurographics conference on Computer Animation, pages 275–284, 2012.
  42. The persona effect: How substantial is it? In People and computers XIII, pages 53–66. Springer, 1998.
  43. Gesture and speech in interaction: An overview, 2014.
  44. Codetalker: Speech-driven 3d facial animation with discrete motion prior. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  45. A practical and configurable lip sync method for games. In Proceedings of Motion on Games, pages 131–140, 2013.
  46. Diffusestylegesture: Stylized audio-driven co-speech gesture generation with diffusion models. In Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, IJCAI-23, pages 5860–5868. International Joint Conferences on Artificial Intelligence Organization, 2023.
  47. Audio-driven stylized gesture generation with flow-based model. In ECCV. Springer, 2022.
  48. Generating holistic 3d human motion from speech. In CVPR, 2023.
  49. Robots learn social skills: End-to-end learning of co-speech gesture generation for humanoid robots. In 2019 International Conference on Robotics and Automation (ICRA), pages 4303–4309. IEEE, 2019.
  50. Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics, 39(6), 2020a.
  51. Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics (TOG), 39(6):1–16, 2020b.
  52. Sadtalker: Learning realistic 3d motion coefficients for stylized audio-driven single image talking face animation. arXiv preprint arXiv:2211.12194, 2022.
  53. Taming diffusion models for audio-driven co-speech gesture generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10544–10553, 2023.
Citations (14)

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com