Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
143 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Towards Variable and Coordinated Holistic Co-Speech Motion Generation (2404.00368v2)

Published 30 Mar 2024 in cs.CV

Abstract: This paper addresses the problem of generating lifelike holistic co-speech motions for 3D avatars, focusing on two key aspects: variability and coordination. Variability allows the avatar to exhibit a wide range of motions even with similar speech content, while coordination ensures a harmonious alignment among facial expressions, hand gestures, and body poses. We aim to achieve both with ProbTalk, a unified probabilistic framework designed to jointly model facial, hand, and body movements in speech. ProbTalk builds on the variational autoencoder (VAE) architecture and incorporates three core designs. First, we introduce product quantization (PQ) to the VAE, which enriches the representation of complex holistic motion. Second, we devise a novel non-autoregressive model that embeds 2D positional encoding into the product-quantized representation, thereby preserving essential structure information of the PQ codes. Last, we employ a secondary stage to refine the preliminary prediction, further sharpening the high-frequency details. Coupling these three designs enables ProbTalk to generate natural and diverse holistic co-speech motions, outperforming several state-of-the-art methods in qualitative and quantitative evaluations, particularly in terms of realism. Our code and model will be released for research purposes at https://feifeifeiliu.github.io/probtalk/.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (67)
  1. Style transfer for co-speech gesture animation: A multi-speaker conditional-mixture approach. In European Conference on Computer Vision (ECCV), pages 248–265. Springer, 2020.
  2. Style-controllable speech-driven gesture synthesis using normalising flows. In Computer Graphics Forum (CGF), pages 487–496. Wiley Online Library, 2020.
  3. Rhythmic gesticulator: Rhythm-aware co-speech gesture synthesis with hierarchical neural embeddings. ACM Transactions on Graphics (TOG), 41(6):1–19, 2022.
  4. Gesturediffuclip: Gesture diffusion model with clip latents. ACM Transactions on Graphics (TOG), 42(4):1–18, 2023.
  5. Circle: Capture in rich contextual environments. In Computer Vision and Pattern Recognition (CVPR), pages 21211–21221, 2023.
  6. wav2vec 2.0: A framework for self-supervised learning of speech representations. In Conference on Neural Information Processing Systems (NeurIPS), pages 12449–12460, 2020.
  7. Attention augmented convolutional networks. In International Conference on Computer Vision (ICCV), pages 3286–3295, 2019.
  8. A unified 3d human motion synthesis model via conditional variational auto-encoder. In International Conference on Computer Vision (ICCV), pages 11645–11655, 2021.
  9. Beat: the behavior expression animation toolkit. In International Conference on Computer Graphics and Interactive Techniques (SIGGRAPH), pages 477–486, 2001.
  10. Maskgit: Masked generative image transformer. In Computer Vision and Pattern Recognition (CVPR), pages 11315–11325, 2022.
  11. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  12. Trunk-branch ensemble convolutional neural networks for video-based face recognition. Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 40(04):1002–1014, 2018.
  13. FaceFormer: Speech-driven 3D facial animation with transformers. In Computer Vision and Pattern Recognition (CVPR), pages 18770–18780, 2022.
  14. Investigating the use of recurrent motion modelling for speech gesture generation. In International Conference on Intelligent Virtual Agents (IVA), pages 93–98, 2018.
  15. Learning individual styles of conversational gesture. In Computer Vision and Pattern Recognition (CVPR), 2019.
  16. Susan Goldin-Meadow. The role of gesture in communication and thinking. Trends in Cognitive Sciences, 3(11):419–429, 1999.
  17. Generative adversarial networks. Communications of the ACM, 63(11):139–144, 2020.
  18. Robert Gray. Vector quantization. IEEE Assp Magazine, 1(2):4–29, 1984.
  19. Action2motion: Conditioned generation of 3d human motions. In Proceedings of the 28th ACM International Conference on Multimedia, pages 2021–2029, 2020.
  20. Generating diverse and natural 3d human motions from text. In Computer Vision and Pattern Recognition (CVPR), pages 5152–5161, 2022.
  21. Learning speech-driven 3D conversational gestures from video. In International Conference on Intelligent Virtual Agents (IVA), pages 101–108, 2021.
  22. Robust motion in-betweening. Transactions on Graphics (TOG), 39(4):60–1, 2020.
  23. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840–6851, 2020.
  24. Arbitrary style transfer in real-time with adaptive instance normalization. In International Conference on Computer Vision (ICCV), pages 1501–1510, 2017.
  25. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning (ICML), pages 448–456, 2015.
  26. Adam Kendon. Gesture: Visible action as utterance. Cambridge University Press, 2004.
  27. Synthesizing multimodal utterances for conversational agents. Computer Animation and Virtual Worlds, 15(1):39–52, 2004.
  28. Analyzing input and output representations for speech-driven gesture generation. In International Conference on Intelligent Virtual Agents (IVA), pages 97–104, 2019.
  29. Gesture controllers. In International Conference on Computer Graphics and Interactive Techniques (SIGGRAPH), pages 1–11, 2010.
  30. Learning to generate diverse dance motions with transformer. arXiv preprint arXiv:2008.08171, 2020.
  31. Audio2Gestures: Generating diverse gestures from speech audio with conditional variational autoencoders. In Computer Vision and Pattern Recognition (CVPR), pages 11293–11302, 2021.
  32. Seeg: Semantic energized co-speech gesture generation. In Computer Vision and Pattern Recognition (CVPR), pages 10473–10482, 2022.
  33. Disco: Disentangled implicit content and rhythm learning for diverse co-speech gestures synthesis. In Proceedings of the 30th ACM International Conference on Multimedia, pages 3764–3773, 2022a.
  34. Beat: A large-scale semantic and emotional multi-modal dataset for conversational gestures synthesis. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part VII, pages 612–630. Springer, 2022b.
  35. Emage: Towards unified holistic co-speech gesture generation via masked audio gesture modeling. arXiv preprint arXiv:2401.00374, 2023.
  36. Learning hierarchical cross-modal association for co-speech gesture generation. In Computer Vision and Pattern Recognition (CVPR), pages 10462–10472, 2022c.
  37. Grammar: Ground-aware motion model for 3d human motion reconstruction. In Proceedings of the 31st ACM International Conference on Multimedia, pages 2817–2828, 2023.
  38. Contact-aware human motion generation from textual descriptions. arXiv preprint arXiv:2403.15709, 2024.
  39. Rectifier nonlinearities improve neural network acoustic models. In International Conference on Machine Learning (ICML), page 3, 2013.
  40. Weakly-supervised action transition learning for stochastic human motion prediction. In Computer Vision and Pattern Recognition (CVPR), pages 8151–8160, 2022.
  41. Learning to listen: Modeling non-deterministic dyadic facial motion. In Computer Vision and Pattern Recognition (CVPR), pages 20395–20405, 2022.
  42. Expressive body capture: 3D hands, face, and body from a single image. In Computer Vision and Pattern Recognition (CVPR), pages 10975–10985, 2019.
  43. Generating diverse structure for image inpainting with hierarchical vq-vae. In Computer Vision and Pattern Recognition (CVPR), pages 10775–10784, 2021.
  44. Action-conditioned 3d human motion synthesis with transformer vae. In International Conference on Computer Vision (ICCV), pages 10985–10995, 2021.
  45. Temos: Generating diverse human motions from textual descriptions. In European Conference on Computer Vision (ECCV), pages 480–497. Springer, 2022.
  46. The kit motion-language dataset. Big data, 4(4):236–252, 2016.
  47. Greta. A believable embodied conversational agent. In Multimodal Intelligent Information Presentation, pages 3–25. 2005.
  48. Emotiongesture: Audio-driven diverse emotional co-speech 3d gesture generation. arXiv preprint arXiv:2305.18891, 2023.
  49. Motion in-betweening via two-stage transformers. Transactions on Graphics (TOG), 41(6):1–16, 2022.
  50. Generating diverse high-fidelity images with vq-vae-2. Conference on Neural Information Processing Systems (NeurIPS), 32, 2019.
  51. Bailando: 3D dance generation by actor-critic GPT with choreographic memory. In Computer Vision and Pattern Recognition (CVPR), pages 11050–11059, 2022.
  52. Neural discrete representation learning. In Conference on Neural Information Processing Systems (NeurIPS), 2017.
  53. Attention is all you need. In Conference on Neural Information Processing Systems (NeurIPS), 2017.
  54. Synthesizing long-term 3d human motion and interaction in 3d scenes. In Computer Vision and Pattern Recognition (CVPR), pages 9401–9411, 2021.
  55. Learning product codebooks using vector-quantized autoencoders for image retrieval. In 2019 IEEE Global Conference on Signal and Information Processing (GlobalSIP), pages 1–5. IEEE, 2019.
  56. Codetalker: Speech-driven 3d facial animation with discrete motion prior. In Computer Vision and Pattern Recognition (CVPR), pages 12780–12790, 2023.
  57. Diffusestylegesture: Stylized audio-driven co-speech gesture generation with diffusion models. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI-23), pages 5860–5868, 2023.
  58. Gesture2Vec: Clustering gestures using representation learning methods for co-speech gesture generation. In International Conference on Intelligent Robots and Systems (IROS), pages 3100–3107. IEEE, 2022.
  59. Generating holistic 3d human motion from speech. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 469–480, 2023.
  60. Emog: Synthesizing emotive co-speech 3d gesture with diffusion model. arXiv preprint arXiv:2306.11496, 2023.
  61. Speech gesture generation from the trimodal context of text, audio, and speaker identity. Transactions on Graphics (TOG), 39(6):1–16, 2020.
  62. T2m-gpt: Generating human motion from textual descriptions with discrete representations. In Computer Vision and Pattern Recognition (CVPR), 2023.
  63. Motiondiffuse: Text-driven human motion generation with diffusion model. Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2024.
  64. On the continuity of rotation representations in neural networks. In Computer Vision and Pattern Recognition (CVPR), pages 5745–5753, 2019.
  65. Taming diffusion models for audio-driven co-speech gesture generation. In Computer Vision and Pattern Recognition (CVPR), pages 10544–10553, 2023a.
  66. Human motion generation: A survey. Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2023b.
  67. Music2dance: Dancenet for music-driven dance generation. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), 18(2):1–21, 2022.
Citations (5)

Summary

We haven't generated a summary for this paper yet.

Github Logo Streamline Icon: https://streamlinehq.com

GitHub

X Twitter Logo Streamline Icon: https://streamlinehq.com