DiffSHEG: A Diffusion-Based Approach for Real-Time Speech-driven Holistic 3D Expression and Gesture Generation (2401.04747v2)
Abstract: We propose DiffSHEG, a Diffusion-based approach for Speech-driven Holistic 3D Expression and Gesture generation with arbitrary length. While previous works focused on co-speech gesture or expression generation individually, the joint generation of synchronized expressions and gestures remains barely explored. To address this, our diffusion-based co-speech motion generation transformer enables uni-directional information flow from expression to gesture, facilitating improved matching of joint expression-gesture distributions. Furthermore, we introduce an outpainting-based sampling strategy for arbitrary long sequence generation in diffusion models, offering flexibility and computational efficiency. Our method provides a practical solution that produces high-quality synchronized expression and gesture generation driven by speech. Evaluated on two public datasets, our approach achieves state-of-the-art performance both quantitatively and qualitatively. Additionally, a user study confirms the superiority of DiffSHEG over prior approaches. By enabling the real-time generation of expressive and synchronized motions, DiffSHEG showcases its potential for various applications in the development of digital humans and embodied agents.
- Listen, denoise, action! audio-driven motion synthesis with diffusion models. ACM Trans. Graph., 42(4):44:1–44:20, 2023.
- Rhythmic gesticulator: Rhythm-aware co-speech gesture synthesis with hierarchical neural embeddings. ACM Transactions on Graphics (TOG), 41(6):1–19, 2022.
- Gesturediffuclip: Gesture diffusion model with clip latents. ACM Trans. Graph, 2023.
- wav2vec 2.0: A framework for self-supervised learning of speech representations. arXiv preprint arXiv:2006.11477, 2020.
- Speech2affectivegestures: Synthesizing co-speech gestures with generative adversarial affective expression learning. In Proceedings of the 29th ACM International Conference on Multimedia, page 2027–2036, New York, NY, USA, 2021a. Association for Computing Machinery.
- Text2gestures: A transformer-based network for generating emotive body gestures for virtual agents. CoRR, abs/2101.11101, 2021b.
- Expressive speech-driven facial animation. ACM Transactions on Graphics, 24(4):1283–1302, 2005.
- Beat: the behavior expression animation toolkit. In Life-Like Characters, pages 163–185. Springer, 2004.
- Capture, learning, and synthesis of 3d speaking styles. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 10101–10111, 2019.
- Jali: an animator-centric viseme model for expressive lip synchronization. ACM Transactions on Graphics), 35(4):1–11, 2016.
- Vid2speech: Speech reconstruction from silent video. 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017.
- Faceformers. In In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
- Susan Goldin-Meadow. The role of gesture in communication and thinking. Trends in cognitive sciences, 3(11):419–429, 1999.
- Learning speech-driven 3d conversational gestures from video. In Proceedings of the 21st ACM International Conference on Intelligent Virtual Agents, pages 101–108, 2021.
- Evaluation of speech-to-gesture generation using bi-directional lstm network. In Proceedings of the 18th International Conference on Intelligent Virtual Agents, pages 79–86, 2018.
- Denoising Diffusion Probabilistic Models. Advances in Neural Information Processing Systems, 33:6840–6851, 2020.
- Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:3451–3460, 2021.
- Robot behavior toolkit: generating effective social behaviors for robots. In 2012 7th ACM/IEEE International Conference on Human-Robot Interaction (HRI), pages 25–32. IEEE, 2012.
- Arbitrary style transfer in real-time with adaptive instance normalization. 2017 IEEE International Conference on Computer Vision (ICCV), pages 1510–1519, 2017.
- Modality dropout for improved performance-driven talking faces. In Proceedings of the 2020 International Conference on Multimodal Interaction, pages 378–386, 2020.
- Audio-driven facial animation by joint end-to-end learning of pose and emotion. ACM Transactions on Graphics (TOG), 36(4):1–12, 2017.
- Michael Kipp. Gesture Generation by Imitation: From Human Behavior to Computer Character Animation. Dissertation.com, Boca Raton, 2004.
- Towards a common framework for multimodal generation: The behavior markup language. In International workshop on intelligent virtual agents, pages 205–217. Springer, 2006.
- Gesticulator: A framework for semantically-aware speech-driven gesture generation. In Proceedings of the 2020 International Conference on Multimodal Interaction, pages 242–250, 2020.
- Ai choreographer: Music conditioned 3d dance generation with aist++. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 13401–13412, 2021.
- Beat: A large-scale semantic and emotional multi-modal dataset for conversational gestures synthesis. In ECCV, 2022a.
- Learning hierarchical cross-modal association for co-speech gesture generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022b.
- Video-audio driven real-time facial animation. ACM Transactions on Graphics, 34(6):1–10, 2015.
- Repaint: Inpainting using denoising diffusion probabilistic models. In CVPR, 2022.
- Animated speech: research progress and applications. Audiovisual Speech Processing, page 309–345, 2012.
- Expressive body capture: 3D hands, face, and body from a single image. In Proceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages 10975–10985, 2019.
- End-to-end learning for 3d facial animation from speech. In Proceedings of the ACM International Conference on Multimodal Interaction, pages 361–365, 2018.
- Real-time streaming video denoising with bidirectional buffers. In Proceedings of the 30th ACM International Conference on Multimedia, page 2758–2766, New York, NY, USA, 2022. Association for Computing Machinery.
- Meshtalk: 3d face animation from speech using cross-modality disentanglement. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1173–1182, 2021.
- Difftalk: Crafting diffusion models for generalized talking head synthesis. arXiv preprint arXiv:2301.03786, 2023.
- Efficient attention: Attention with linear complexities. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 3531–3539, 2021.
- Robotics Softbank. Naoqi api documentation. In 2016 IEEE International Conference on Multimedia and Expo (ICME), vol. http://doc. aldebaran. com/2-5/homepepper. html, 2018.
- Denoising diffusion implicit models. In International Conference on Learning Representations, 2021.
- Emotion recognition using facial expressions. Procedia Computer Science, 108:1175–1184, 2017. International Conference on Computational Science, ICCS 2017, 12-14 June 2017, Zurich, Switzerland.
- A deep learning approach for generalized speech animation. ACM Transactions on Graphics, 36(4):1–11, 2017.
- Dynamic units of visual speech. In Proceedings of the ACM SIGGRAPH/Eurographics conference on Computer Animation, pages 275–284, 2012.
- The persona effect: How substantial is it? In People and computers XIII, pages 53–66. Springer, 1998.
- Gesture and speech in interaction: An overview, 2014.
- Codetalker: Speech-driven 3d facial animation with discrete motion prior. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
- A practical and configurable lip sync method for games. In Proceedings of Motion on Games, pages 131–140, 2013.
- Diffusestylegesture: Stylized audio-driven co-speech gesture generation with diffusion models. In Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, IJCAI-23, pages 5860–5868. International Joint Conferences on Artificial Intelligence Organization, 2023.
- Audio-driven stylized gesture generation with flow-based model. In ECCV. Springer, 2022.
- Generating holistic 3d human motion from speech. In CVPR, 2023.
- Robots learn social skills: End-to-end learning of co-speech gesture generation for humanoid robots. In 2019 International Conference on Robotics and Automation (ICRA), pages 4303–4309. IEEE, 2019.
- Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics, 39(6), 2020a.
- Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics (TOG), 39(6):1–16, 2020b.
- Sadtalker: Learning realistic 3d motion coefficients for stylized audio-driven single image talking face animation. arXiv preprint arXiv:2211.12194, 2022.
- Taming diffusion models for audio-driven co-speech gesture generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10544–10553, 2023.