Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

LLAniMAtion: LLAMA Driven Gesture Animation (2405.08042v1)

Published 13 May 2024 in cs.HC, cs.AI, cs.CV, cs.GR, and cs.LG

Abstract: Co-speech gesturing is an important modality in conversation, providing context and social cues. In character animation, appropriate and synchronised gestures add realism, and can make interactive agents more engaging. Historically, methods for automatically generating gestures were predominantly audio-driven, exploiting the prosodic and speech-related content that is encoded in the audio signal. In this paper we instead experiment with using LLM features for gesture generation that are extracted from text using LLAMA2. We compare against audio features, and explore combining the two modalities in both objective tests and a user study. Surprisingly, our results show that LLAMA2 features on their own perform significantly better than audio features and that including both modalities yields no significant difference to using LLAMA2 features in isolation. We demonstrate that the LLAMA2 based model can generate both beat and semantic gestures without any audio input, suggesting LLMs can provide rich encodings that are well suited for gesture generation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (47)
  1. 2013. word2vec. https://code.google.com/archive/p/word2vec/ Accessed on Jan 2024.
  2. Style-Controllable Speech-Driven Gesture Synthesis Using Normalising Flows. In Computer Graphics Forum, Vol. 39. Wiley Online Library, 487–496.
  3. Listen, Denoise, Action! Audio-Driven Motion Synthesis with Diffusion Models. ACM Trans. Graph. 42, 4 (2023), 1–20. https://doi.org/10.1145/3592458
  4. Rhythmic gesticulator: Rhythm-aware co-speech gesture synthesis with hierarchical neural embeddings. ACM Transactions on Graphics (TOG) 41, 6 (2022), 1–19.
  5. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems 33 (2020), 12449–12460.
  6. Speech2affectivegestures: Synthesizing co-speech gestures with generative adversarial affective expression learning. In Proceedings of the 29th ACM International Conference on Multimedia. 2027–2036.
  7. Text2gestures: A transformer-based network for generating emotive body gestures for virtual agents. In 2021 IEEE virtual reality and 3D user interfaces (VR). IEEE, 1–10.
  8. Enriching word vectors with subword information. Transactions of the association for computational linguistics 5 (2017), 135–146.
  9. Hans Rutger Bosker and David Peeters. 2021. Beat gestures influence which speech sounds you hear. Proceedings of the Royal Society B 288, 1943 (2021), 20202419.
  10. Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860 (2019).
  11. The interplay between gesture and speech in the production of referring expressions: Investigating the tradeoff hypothesis. Topics in Cognitive Science 4, 2 (2012), 232–248.
  12. Diffusion-based co-speech gesture generation using joint text and audio representation. In Proceedings of the 25th International Conference on Multimodal Interaction. 755–762.
  13. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
  14. A motion matching-based framework for controllable gesture synthesis from speech. In ACM SIGGRAPH 2022 Conference Proceedings. 1–9.
  15. Evaluation of speech-to-gesture generation using bi-directional LSTM network. In Proceedings of the 18th International Conference on Intelligent Virtual Agents. 79–86.
  16. Large language models in textual analysis for gesture selection. In Proceedings of the 25th International Conference on Multimodal Interaction. 378–387.
  17. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems 30 (2017).
  18. Adam Kendon. 1994. Do gestures communicate? A review. Research on language and social interaction 27, 3 (1994), 175–200.
  19. Analyzing input and output representations for speech-driven gesture generation. In Proceedings of the 19th ACM International Conference on Intelligent Virtual Agents. 97–104.
  20. The GENEA Challenge 2023: A large-scale evaluation of gesture generation models in monadic and dyadic settings. In Proceedings of the 25th International Conference on Multimodal Interaction. 792–801.
  21. Talking with hands 16.2 m: A large-scale dataset of synchronized body-finger motion and audio for conversational motion analysis and synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 763–772.
  22. Ai choreographer: Music conditioned 3d dance generation with aist++. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 13401–13412.
  23. Beat: A large-scale semantic and emotional multi-modal dataset for conversational gestures synthesis. In European Conference on Computer Vision. Springer, 612–630.
  24. Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017).
  25. David McNeill. 1985. So you think gestures are nonverbal? Psychological review 92, 3 (1985), 350.
  26. From Audio to Photoreal Embodiment: Synthesizing Humans in Conversations. arXiv preprint arXiv:2401.01885 (2024).
  27. A Comprehensive Review of Data-Driven Co-Speech Gesture Generation. Computer Graphics Forum 42, 2 (2023), 569–596. https://doi.org/10.1111/cgf.14776 arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1111/cgf.14776
  28. GPT-4 Technical Report. arXiv:2303.08774 [cs.CL]
  29. CGVU: Semantics-guided 3D body gesture synthesis. In Proc. GENEA Workshop. https://doi. org/10.5281/zenodo, Vol. 4090879.
  30. Comparison of some listening test methods: a case study. Acta Acustica united with Acustica 91, 2 (2005), 356–364.
  31. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 1532–1543.
  32. Energy flows in gesture-speech physics: The respiratory-vocal system and its coupling with hand gestures. The Journal of the Acoustical Society of America 148, 3 (2020), 1231–1247.
  33. Speech drives templates: Co-speech gesture synthesis with learned templates. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 11077–11086.
  34. Multi-task self-supervised learning for robust speech recognition. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6989–6993.
  35. Michael Studdert-Kennedy. 1994. Hand and Mind: What Gestures Reveal About Thought. Language and Speech 37, 2 (1994), 203–209.
  36. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023).
  37. Attention is all you need. Advances in neural information processing systems 30 (2017).
  38. UEA Digital Humans entry to the GENEA Challenge 2022. In Proceedings of the 2022 International Conference on Multimodal Interaction. 771–777.
  39. The UEA Digital Humans Entry to the GENEA Challenge 2023. In Proceedings of the 25th International Conference on Multimodal Interaction (, Paris, France,) (ICMI ’23). Association for Computing Machinery, New York, NY, USA, 802–810. https://doi.org/10.1145/3577190.3616116
  40. Arm motion symmetry in conversation. Speech Communication 144 (2022), 75–88.
  41. Harnessing the power of llms in practice: A survey on chatgpt and beyond. arXiv preprint arXiv:2304.13712 (2023).
  42. Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics (TOG) 39, 6 (2020), 1–16.
  43. Robots learn social skills: End-to-end learning of co-speech gesture generation for humanoid robots. In 2019 International Conference on Robotics and Automation (ICRA). IEEE, 4303–4309.
  44. GestureGPT: Zero-shot Interactive Gesture Understanding and Grounding with Large Language Model Agents. arXiv preprint arXiv:2310.12821 (2023).
  45. DiffMotion: Speech-Driven Gesture Synthesis Using Denoising Diffusion Model. In MultiMedia Modeling: 29th International Conference, MMM 2023, Bergen, Norway, January 9–12, 2023, Proceedings, Part I. Springer, 231–242.
  46. GestureMaster: Graph-based speech-driven gesture generation. In Proceedings of the 2022 International Conference on Multimodal Interaction. 764–770.
  47. On the Continuity of Rotation Representations in Neural Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Jonathan Windle (1 paper)
  2. Iain Matthews (6 papers)
  3. Sarah Taylor (6 papers)

Summary

We haven't generated a summary for this paper yet.