Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Chain of Generation: Multi-Modal Gesture Synthesis via Cascaded Conditional Control (2312.15900v1)

Published 26 Dec 2023 in cs.CV

Abstract: This study aims to improve the generation of 3D gestures by utilizing multimodal information from human speech. Previous studies have focused on incorporating additional modalities to enhance the quality of generated gestures. However, these methods perform poorly when certain modalities are missing during inference. To address this problem, we suggest using speech-derived multimodal priors to improve gesture generation. We introduce a novel method that separates priors from speech and employs multimodal priors as constraints for generating gestures. Our approach utilizes a chain-like modeling method to generate facial blendshapes, body movements, and hand gestures sequentially. Specifically, we incorporate rhythm cues derived from facial deformation and stylization prior based on speech emotions, into the process of generating gestures. By incorporating multimodal priors, our method improves the quality of generated gestures and eliminate the need for expensive setup preparation during inference. Extensive experiments and user studies confirm that our proposed approach achieves state-of-the-art performance.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (38)
  1. GestureDiffuCLIP: Gesture Diffusion Model with CLIP Latents. arXiv preprint arXiv:2303.14613.
  2. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems, 33: 12449–12460.
  3. Enriching word vectors with subword information. Transactions of the association for computational linguistics, 5: 135–146.
  4. Animated conversation: rule-based generation of facial expression, gesture & spoken intonation for multiple conversational agents. In Proceedings of the 21st annual conference on Computer graphics and interactive techniques, 413–420.
  5. A simple framework for contrastive learning of visual representations. In International conference on machine learning, 1597–1607. PMLR.
  6. Learning individual styles of conversational gesture. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3497–3506.
  7. An uncertainty analysis on finite difference time-domain computations with artificial neural networks: improving accuracy while maintaining low computational costs. IEEE Antennas and Propagation Magazine, 65(1): 60–70.
  8. Arbitrary style transfer in real-time with adaptive instance normalization. In Proceedings of the IEEE international conference on computer vision, 1501–1510.
  9. Towards natural gesture synthesis: Evaluating gesture units in a data-driven approach to gesture synthesis. In Intelligent Virtual Agents: 7th International Conference, IVA 2007 Paris, France, September 17-19, 2007 Proceedings 7, 15–28. Springer.
  10. Synthesizing multimodal utterances for conversational agents. Computer animation and virtual worlds, 15(1): 39–52.
  11. A large, crowdsourced evaluation of gesture generation systems on common data: The GENEA Challenge 2020. In 26th international conference on intelligent user interfaces, 11–21.
  12. Gesture controllers. In ACM SIGGRAPH 2010 papers, 1–11.
  13. Real-time prosody-driven synthesis of body language. In ACM SIGGRAPH Asia 2009 papers, 1–10.
  14. Audio2gestures: Generating diverse gestures from speech audio with conditional variational autoencoders. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 11293–11302.
  15. Ai choreographer: Music conditioned 3d dance generation with aist++. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 13401–13412.
  16. FineDance: A Fine-grained Choreography Dataset for 3D Full Body Dance Generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 10234–10243.
  17. Consistent123: One image to highly consistent 3d asset using case-aware diffusion priors. arXiv preprint arXiv:2309.17261.
  18. BEAT: A Large-Scale Semantic and Emotional Multi-Modal Dataset for Conversational Gestures Synthesis. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part VII, 612–630. Springer.
  19. Audio-Driven Co-Speech Gesture Video Generation. arXiv preprint arXiv:2212.02350.
  20. A Comprehensive Review of Data-Driven Co-Speech Gesture Generation. In Computer Graphics Forum, volume 42, 569–596. Wiley Online Library.
  21. EmotionGesture: Audio-Driven Diverse Emotional Co-Speech 3D Gesture Generation. arXiv preprint arXiv:2305.18891.
  22. Speech drives templates: Co-speech gesture synthesis with learned templates. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 11077–11086.
  23. Gesture and speech in interaction: An overview.
  24. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35: 24824–24837.
  25. Bridging vision and language encoders: Parameter-efficient tuning for referring image segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 17503–17512.
  26. UnifiedGesture: A Unified Gesture Synthesis Model for Multiple Skeletons. In Proceedings of the 31st ACM International Conference on Multimedia, 1033–1044.
  27. DiffuseStyleGesture: Stylized Audio-Driven Co-Speech Gesture Generation with Diffusion Models. arXiv preprint arXiv:2305.04919.
  28. QPGesture: Quantization-Based and Phase-Guided Motion Matching for Natural Speech-Driven Gesture Generation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR, 2321–2330. IEEE.
  29. The ReprGesture entry to the GENEA Challenge 2022. In Proceedings of the 2022 International Conference on Multimodal Interaction, 758–763.
  30. The DiffuseStyleGesture+ entry to the GENEA Challenge 2023. In Proceedings of the 25th International Conference on Multimodal Interaction, 779–785.
  31. Gesture2Vec: Clustering gestures using representation learning methods for co-speech gesture generation. In 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 3100–3107. IEEE.
  32. Generating Holistic 3D Human Motion from Speech. arXiv preprint arXiv:2212.04420.
  33. EMoG: Synthesizing Emotive Co-speech 3D Gesture with Diffusion Model. arXiv preprint arXiv:2306.11496.
  34. Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics (TOG), 39(6): 1–16.
  35. Robots learn social skills: End-to-end learning of co-speech gesture generation for humanoid robots. In 2019 International Conference on Robotics and Automation (ICRA), 4303–4309. IEEE.
  36. Motiondiffuse: Text-driven human motion generation with diffusion model. arXiv preprint arXiv:2208.15001.
  37. Perturbed Self-Distillation: Weakly supervised large-scale point cloud semantic segmentation. In ICCV, 15520–15528.
  38. Learning All-In Collaborative Multiview Binary Representation for Clustering. IEEE Transactions on Neural Networks and Learning Systems.
Citations (6)

Summary

We haven't generated a summary for this paper yet.