Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
91 tokens/sec
GPT-4o
12 tokens/sec
Gemini 2.5 Pro Pro
o3 Pro
5 tokens/sec
GPT-4.1 Pro
15 tokens/sec
DeepSeek R1 via Azure Pro
33 tokens/sec
Gemini 2.5 Flash Deprecated
12 tokens/sec
2000 character limit reached

ConvoFusion: Multi-Modal Conversational Diffusion for Co-Speech Gesture Synthesis (2403.17936v1)

Published 26 Mar 2024 in cs.CV

Abstract: Gestures play a key role in human communication. Recent methods for co-speech gesture generation, while managing to generate beat-aligned motions, struggle generating gestures that are semantically aligned with the utterance. Compared to beat gestures that align naturally to the audio signal, semantically coherent gestures require modeling the complex interactions between the language and human motion, and can be controlled by focusing on certain words. Therefore, we present ConvoFusion, a diffusion-based approach for multi-modal gesture synthesis, which can not only generate gestures based on multi-modal speech inputs, but can also facilitate controllability in gesture synthesis. Our method proposes two guidance objectives that allow the users to modulate the impact of different conditioning modalities (e.g. audio vs text) as well as to choose certain words to be emphasized during gesturing. Our method is versatile in that it can be trained either for generating monologue gestures or even the conversational gestures. To further advance the research on multi-party interactive gestures, the DnD Group Gesture dataset is released, which contains 6 hours of gesture data showing 5 people interacting with one another. We compare our method with several recent works and demonstrate effectiveness of our method on a variety of tasks. We urge the reader to watch our supplementary video at our website.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (84)
  1. Style transfer for co-speech gesture animation: A multi-speaker conditional-mixture approach. 2020.
  2. Style-controllable speech-driven gesture synthesis using normalising flows. Comput. Graph. Forum, 39(2):487–496, 2020.
  3. Rhythmic gesticulator: Rhythm-aware co-speech gesture synthesis with hierarchical neural embeddings. ACM TOG, 41(6):1–19, 2022.
  4. Gesturediffuclip: Gesture diffusion model with clip latents. ACM TOG, 42(4):1–18, 2023.
  5. Speech2affectivegestures: Synthesizing co-speech gestures with generative adversarial affective expression learning. In ACM MM, 2021a.
  6. Text2gestures: A transformer-based network for generating emotive body gestures for virtual agents. In 2021 IEEE Virtual Reality and 3D User Interfaces (VR), 2021b.
  7. Iemocap: Interactive emotional dyadic motion capture database. Language resources and evaluation, 2008.
  8. Justine Cassell. Embodied conversational interface agents. Commun. ACM, 2000.
  9. Animated conversation: Rule-based generation of facial expression, gesture & spoken intonation for multiple conversational agents. In Proceedings of the 21st Annual Conference on Computer Graphics and Interactive Techniques, 1994a.
  10. Animated conversation: rule-based generation of facial expression, gesture & spoken intonation for multiple conversational agents. In SIGGRAPH Conference Proceedings, 1994b.
  11. Beat: The behavior expression animation toolkit. In SIGGRAPH Conference Proceedings, 2001.
  12. Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models. ACM TOG, 42(4), 2023.
  13. Executing your commands via motion diffusion in latent space. In CVPR, 2023.
  14. Mofusion: A framework for denoising-diffusion-based motion synthesis. In CVPR, 2023.
  15. Diffusion models beat gans on image synthesis. In NeurIPS, 2021.
  16. Diffusion self-guidance for controllable image generation, 2023.
  17. Investigating the use of recurrent motion modelling for speech gesture generation. In Proceedings of the 18th International Conference on Intelligent Virtual Agents, 2018.
  18. Adversarial gesture generation with realistic gesture phasing. Computers & Graphics, 89:117–130, 2020.
  19. Zeroeggs: Zero-shot example-based gesture generation from speech. Comput. Graph. Forum, 42(1):206–216, 2023a.
  20. Zeroeggs: Zero-shot example-based gesture generation from speech. Computer Graphics Forum, 42(1):206–216, 2023b.
  21. Synthesis of compositional animations from textual descriptions. In International Conference on Computer Vision (ICCV), 2021.
  22. Imos: Intent-driven full-body motion synthesis for human-object interactions. In Eurographics, 2023.
  23. Learning individual styles of conversational gesture. In Computer Vision and Pattern Recognition (CVPR). IEEE, 2019.
  24. Generating diverse and natural 3d human motions from text. In CVPR, 2022.
  25. Learning speech-driven 3d conversational gestures from video. In Proceedings of the 21st ACM International Conference on Intelligent Virtual Agents, 2021a.
  26. Learning speech-driven 3d conversational gestures from video. In Proceedings of the International Conference on Intelligent Virtual Agents, 2021b.
  27. A motion matching-based framework for controllable gesture synthesis from speech. In SIGGRAPH ’22 Conference Proceedings, 2022.
  28. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415, 2016.
  29. Classifier-free diffusion guidance. In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021.
  30. Denoising diffusion probabilistic models. In NeurIPS, 2020.
  31. Panoptic studio: A massively multiview system for social interaction capture. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017.
  32. Adam Kendon. Gesture: Visible action as utterance. Cambridge University Press, 2004.
  33. Auto-encoding variational bayes. In ICLR, 2014.
  34. Diffwave: A versatile diffusion model for audio synthesis. In ICLR, 2021.
  35. Towards a common framework for multimodal generation: The behavior markup language. In Intelligent Virtual Agents, 2006.
  36. Gesticulator: A framework for semantically-aware speech-driven gesture generation. In Proceedings of the 2020 International Conference on Multimodal Interaction, 2020.
  37. Speech2Properties2Gestures: Gesture-property prediction as a tool for generating representational gestures from speech. In Proceedings of the 21th ACM International Conference on Intelligent Virtual Agents, 2021a.
  38. Speech2properties2gestures: Gesture-property prediction as a tool for generating representational gestures from speech. In Proceedings of the 21st ACM International Conference on Intelligent Virtual Agents, 2021b.
  39. Nifty: Neural object interaction fields for guided human motion synthesis, 2023.
  40. Talking with hands 16.2m: A large-scale dataset of synchronized body-finger motion and audio for conversational motion analysis and synthesis. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV), 2019.
  41. Real-time prosody-driven synthesis of body language. ACM TOG, 28(5):1–10, 2009.
  42. Gesture controllers. ACM TOG, 29(4):1–11, 2010.
  43. Audio2gestures: Generating diverse gestures from speech audio with conditional variational autoencoders. In ICCV, 2021a.
  44. Ai choreographer: Music conditioned 3d dance generation with aist++. In ICCV, 2021b.
  45. Seeg: Semantic energized co-speech gesture generation. In CVPR, 2022.
  46. Disco: Disentangled implicit content and rhythm learning for diverse co-speech gestures synthesis. In ACM MM, 2022a.
  47. Beat: A large-scale semantic and emotional multi-modal dataset for conversational gestures synthesis. European Conference on Computer Vision, 2022b.
  48. Audio-driven co-speech gesture video generation. In NeurIPS, 2022c.
  49. Learning hierarchical cross-modal association for co-speech gesture generation. In CVPR, 2022d.
  50. Decoupled weight decay regularization. In ICLR, 2019.
  51. Repaint: Inpainting using denoising diffusion probabilistic models. In CVPR, 2022.
  52. The multisensory perception of co-speech gestures–a review and meta-analysis of neuroimaging studies. Journal of Neurolinguistics, 30:69–77, 2014.
  53. librosa: Audio and music signal analysis in python. In Proceedings of the 14th python in science conference, 2015.
  54. David McNeill. Gesture and thought. University of Chicago press, 2008.
  55. The usc creativeit database of multimodal dyadic interactions: From speech and full body motion capture to continuous emotional annotations. Language resources and evaluation, 50:497–521, 2016.
  56. A comprehensive review of data-driven co-speech gesture generation. Comput. Graph. Forum, 42(2):569–596, 2023.
  57. Ash: Animatable gaussian splats for efficient and photoreal human rendering. In CVPR, 2024.
  58. Learning transferable visual models from natural language supervision. In ICML, 2021.
  59. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21(1), 2020.
  60. Hierarchical text-conditional image generation with clip latents. arXiv, 2022.
  61. High-resolution image synthesis with latent diffusion models. In CVPR, 2021a.
  62. High-resolution image synthesis with latent diffusion models, 2021b.
  63. Photorealistic text-to-image diffusion models with deep language understanding. arXiv, 2022.
  64. Physcap: Physically plausible monocular 3d motion capture in real time. ACM Transactions on Graphics, 39, 2020.
  65. Bailando: 3d dance generation via actor-critic gpt with choreographic memory. In CVPR, 2022.
  66. Deep unsupervised learning using nonequilibrium thermodynamics. In ICML, 2015.
  67. Social diffusion: Long-term multiple human motion anticipation. In ICCV, 2023.
  68. Human motion diffusion model. In ICLR, 2023.
  69. Llama: Open and efficient foundation language models, 2023.
  70. Edge: Editable dance generation from music. In CVPR, pages 448–458, 2023.
  71. Attention is all you need. In NeurIPS, 2017.
  72. Gesture and speech in interaction: An overview. Speech Communication, 57:209–232, 2014.
  73. Transformers: State-of-the-art natural language processing. In EMNLP, 2020.
  74. Diffusestylegesture: Stylized audio-driven co-speech gesture generation with diffusion models. In IJCAI, 2023.
  75. Generating holistic 3D human motion from speech. In CVPR, 2023.
  76. Robots learn social skills: End-to-end learning of co-speech gesture generation for humanoid robots. In 2019 International Conference on Robotics and Automation (ICRA), 2019a.
  77. Robots learn social skills: End-to-end learning of co-speech gesture generation for humanoid robots. In Proc. of The International Conference in Robotics and Automation (ICRA), 2019b.
  78. Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM TOG, page 1–16, 2020.
  79. Physdiff: Physics-guided human motion diffusion model. In ICCV, 2023.
  80. Motiondiffuse: Text-driven human motion generation with diffusion model. arXiv preprint arXiv:2208.15001, 2022.
  81. Diffugesture: Generating human gesture from two-person dialogue with diffusion models. In International Conference on Multimodal Interaction, 2023a.
  82. Diffugesture: Generating human gesture from two-person dialogue with diffusion models. In International Cconference on Multimodal Interaction, pages 179–185. 2023b.
  83. A survey of large language models. arXiv preprint arXiv:2303.18223, 2023c.
  84. Taming diffusion models for audio-driven co-speech gesture generation. In CVPR, 2023.
Citations (5)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com