Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

EMAGE: Towards Unified Holistic Co-Speech Gesture Generation via Expressive Masked Audio Gesture Modeling (2401.00374v5)

Published 31 Dec 2023 in cs.CV

Abstract: We propose EMAGE, a framework to generate full-body human gestures from audio and masked gestures, encompassing facial, local body, hands, and global movements. To achieve this, we first introduce BEAT2 (BEAT-SMPLX-FLAME), a new mesh-level holistic co-speech dataset. BEAT2 combines a MoShed SMPL-X body with FLAME head parameters and further refines the modeling of head, neck, and finger movements, offering a community-standardized, high-quality 3D motion captured dataset. EMAGE leverages masked body gesture priors during training to boost inference performance. It involves a Masked Audio Gesture Transformer, facilitating joint training on audio-to-gesture generation and masked gesture reconstruction to effectively encode audio and body gesture hints. Encoded body hints from masked gestures are then separately employed to generate facial and body movements. Moreover, EMAGE adaptively merges speech features from the audio's rhythm and content and utilizes four compositional VQ-VAEs to enhance the results' fidelity and diversity. Experiments demonstrate that EMAGE generates holistic gestures with state-of-the-art performance and is flexible in accepting predefined spatial-temporal gesture inputs, generating complete, audio-synchronized results. Our code and dataset are available https://pantomatrix.github.io/EMAGE/

Definition Search Book Streamline Icon: https://streamlinehq.com
References (69)
  1. Skeleton-aware networks for deep motion retargeting. ACM Transactions on Graphics (TOG), 39(4):62–1, 2020.
  2. Style transfer for co-speech gesture animation: A multi-speaker conditional-mixture approach. In European Conference on Computer Vision, pages 248–265. Springer, 2020.
  3. Style-controllable speech-driven gesture synthesis using normalising flows. In Computer Graphics Forum, volume 39, pages 487–496. Wiley Online Library, 2020.
  4. Rhythmic gesticulator: Rhythm-aware co-speech gesture synthesis with hierarchical neural embeddings. ACM Transactions on Graphics (TOG), 41(6):1–19, 2022.
  5. Gesturediffuclip: Gesture diffusion model with clip latents. ACM Trans. Graph.
  6. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems, 33:12449–12460, 2020.
  7. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arxiv 2018. arXiv preprint arXiv:1803.01271, 2, 1803.
  8. ARKitscenes - a diverse real-world dataset for 3d indoor scene understanding using mobile RGB-d data. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1), 2021.
  9. Enriching word vectors with subword information. Transactions of the association for computational linguistics, 5:135–146, 2017.
  10. Biovision BVH. Biovision bvh. In https://research.cs.wisc.edu/graphics /Courses/ cs-838-1999/Jeff/BVH.html, 1999.
  11. Openpose: realtime multi-person 2d pose estimation using part affinity fields. IEEE transactions on pattern analysis and machine intelligence, 43(1):172–186, 2019.
  12. Vision transformer adapter for dense predictions. arXiv preprint arXiv:2205.08534, 2022.
  13. Capture, learning, and synthesis of 3D speaking styles. Computer Vision and Pattern Recognition (CVPR), pages 10101–10111, 2019.
  14. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  15. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  16. Faceformer: Speech-driven 3d facial animation with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18770–18780, 2022.
  17. Faceformer: Speech-driven 3d facial animation with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
  18. A 3-d audio-visual corpus of affective communication. IEEE Transactions on Multimedia, 12(6):591–598, 2010.
  19. Investigating the use of recurrent motion modelling for speech gesture generation. In Proceedings of the 18th International Conference on Intelligent Virtual Agents, pages 93–98, 2018.
  20. Adversarial gesture generation with realistic gesture phasing. Computers & Graphics, 89:117–130, 2020.
  21. Expressgesture: Expressive gesture generation from speech through database matching. Computer Animation and Virtual Worlds, page e2016, 2021.
  22. Differentiable dynamics for articulated 3d human motion reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13190–13200, 2022.
  23. Zeroeggs: Zero-shot example-based gesture generation from speech. Computer Graphics Forum, 42(1):206–216, 2023.
  24. Learning individual styles of conversational gesture. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3497–3506, 2019.
  25. Ast: Audio spectrogram transformer. arXiv preprint arXiv:2104.01778, 2021.
  26. Tm2t: Stochastic and tokenized modeling for the reciprocal generation of 3d human motions and texts. In European Conference on Computer Vision, pages 580–597. Springer, 2022.
  27. Learning speech-driven 3d conversational gestures from video. arXiv preprint arXiv:2102.06837, 2021.
  28. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000–16009, 2022.
  29. Moglow: Probabilistic and controllable motion synthesis using normalising flows. ACM Transactions on Graphics (TOG), 39(6):1–14, 2020.
  30. Human3. 6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE transactions on pattern analysis and machine intelligence, 36(7):1325–1339, 2013.
  31. Moving fast and slow: Analysis of representations and post-processing in speech-driven automatic gesture generation. International Journal of Human–Computer Interaction, pages 1–17, 2021.
  32. Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942, 2019.
  33. Talking with hands 16.2 m: A large-scale dataset of synchronized body-finger motion and audio for conversational motion analysis and synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 763–772, 2019.
  34. Audio2gestures: Generating diverse gestures from speech audio with conditional variational autoencoders. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11293–11302, 2021.
  35. Write-a-speaker: Text-based emotional and rhythmic talking-head generation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 1911–1920, 2021.
  36. Ai choreographer: Music conditioned 3d dance generation with aist++. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 13401–13412, 2021.
  37. Learning a model of facial shape and expression from 4D scans. ACM Transactions on Graphics, (Proc. SIGGRAPH Asia), 36(6):194:1–194:17, 2017.
  38. Disco: Disentangled implicit content and rhythm learning for diverse co-speech gestures synthesis. In Proceedings of the 30th ACM International Conference on Multimedia, pages 3764–3773, 2022.
  39. Beat: A large-scale semantic and emotional multi-modal dataset for conversational gestures synthesis. arXiv preprint arXiv:2203.05297, 2022.
  40. Learning hierarchical cross-modal association for co-speech gesture generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10462–10472, 2022.
  41. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
  42. Double-dcccae: Estimation of body gestures from speech waveform. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 900–904. IEEE, 2021.
  43. Co-speech gesture synthesis using discrete gesture token learning. arXiv preprint arXiv:2303.12822, 2023.
  44. AMASS: Archive of motion capture as surface shapes. In International Conference on Computer Vision, pages 5442–5451, Oct. 2019.
  45. Tim Massey. Vicon study of dynamic object tracking accuracy. In https://www.vicon.com/resources/blog/ vicon-study -of-dynamic-object-tracking-accuracy/, 2020.
  46. Body2hands: Learning to infer 3d hands from conversational gesture body dynamics. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11865–11874, 2021.
  47. Bodyformer: Semantics-guided 3d body gesture synthesis with transformer. ACM Transactions on Graphics (TOG), 42(4):1–12, 2023.
  48. Expressive body capture: 3D hands, face, and body from a single image. In Proceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages 10975–10985, 2019.
  49. Emotalk: Speech-driven emotional disentanglement for 3d face animation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 20687–20697, 2023.
  50. Action-conditioned 3d human motion synthesis with transformer vae. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10985–10995, 2021.
  51. Pix4point: Image pretrained transformers for 3d point cloud understanding. 2022.
  52. Speech drives templates: Co-speech gesture synthesis with learned templates. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11077–11086, 2021.
  53. Meshtalk: 3d face animation from speech using cross-modality disentanglement. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1173–1182, 2021.
  54. Emog: Supporting the sketching of emotional expressions for storyboarding. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems, pages 1–12, 2020.
  55. Direct iterative closest point for real-time visual odometry. In 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops), pages 2050–2056. IEEE, 2011.
  56. Neural discrete representation learning. Advances in neural information processing systems, 30, 2017.
  57. Mead: A large-scale audio-visual dataset for emotional talking-face generation. In ECCV, 2020.
  58. Probabilistic human-like gesture synthesis from speech using gru-based wgan. In GENEA: Generation and Evaluation of Non-verbal Behaviour for Embodied Agents Workshop 2021, 2021.
  59. Modeling the conditional distribution of co-speech upper body gesture jointly using conditional-gan and unrolled-gan. Electronics, 10(3):228, 2021.
  60. Codetalker: Speech-driven 3d facial animation with discrete motion prior. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12780–12790, 2023.
  61. Diffusestylegesture: Stylized audio-driven co-speech gesture generation with diffusion models. arXiv preprint arXiv:2305.04919, 2023.
  62. Qpgesture: Quantization-based and phase-guided motion matching for natural speech-driven gesture generation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR, pages 2321–2330. IEEE, June 2023.
  63. Generating holistic 3d human motion from speech. In CVPR, 2023.
  64. Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics (TOG), 39(6):1–16, 2020.
  65. Robots learn social skills: End-to-end learning of co-speech gesture generation for humanoid robots. In 2019 International Conference on Robotics and Automation (ICRA), pages 4303–4309. IEEE, 2019.
  66. Point transformer. In Proceedings of the IEEE/CVF international conference on computer vision, pages 16259–16268, 2021.
  67. Livelyspeaker: Towards semantic-aware co-speech gesture generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 20807–20817, October 2023.
  68. Taming diffusion models for audio-driven co-speech gesture generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10544–10553, 2023.
  69. Motionbert: A unified perspective on learning human motion representations. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023.
Citations (12)

Summary

We haven't generated a summary for this paper yet.

Github Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com