Generating Holistic 3D Human Motion from Speech (2212.04420v2)
Abstract: This work addresses the problem of generating 3D holistic body motions from human speech. Given a speech recording, we synthesize sequences of 3D body poses, hand gestures, and facial expressions that are realistic and diverse. To achieve this, we first build a high-quality dataset of 3D holistic body meshes with synchronous speech. We then define a novel speech-to-motion generation framework in which the face, body, and hands are modeled separately. The separated modeling stems from the fact that face articulation strongly correlates with human speech, while body poses and hand gestures are less correlated. Specifically, we employ an autoencoder for face motions, and a compositional vector-quantized variational autoencoder (VQ-VAE) for the body and hand motions. The compositional VQ-VAE is key to generating diverse results. Additionally, we propose a cross-conditional autoregressive model that generates body poses and hand gestures, leading to coherent and realistic motions. Extensive experiments and user studies demonstrate that our proposed approach achieves state-of-the-art performance both qualitatively and quantitatively. Our novel dataset and code will be released for research purposes at https://talkshow.is.tue.mpg.de.
- No gestures left behind: Learning relationships between spoken language and freeform gestures. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1884–1895, 2020a.
- Style transfer for co-speech gesture animation: A multi-speaker conditional-mixture approach. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVIII 16, pages 248–265. Springer, 2020b.
- Style-controllable speech-driven gesture synthesis using normalising flows. In Computer Graphics Forum, volume 39, pages 487–496. Wiley Online Library, 2020.
- A stochastic conditioning scheme for diverse human motion prediction. In Computer Vision and Pattern Recognition (CVPR), pages 5223–5232, 2020.
- Rhythmic gesticulator: Rhythm-aware co-speech gesture synthesis with hierarchical neural embeddings. arXiv preprint arXiv:2210.01448, 2022.
- Layer normalization. arXiv, 2016.
- wav2vec 2.0: A framework for self-supervised learning of speech representations. Conference on Neural Information Processing Systems (NeurIPS), 33:12449–12460, 2020.
- Speech2affectivegestures: Synthesizing co-speech gestures with generative adversarial affective expression learning. In Proceedings of the 29th ACM International Conference on Multimedia, pages 2027–2036, 2021.
- Keep it smpl: Automatic estimation of 3d human pose and shape from a single image. In European conference on computer vision, pages 561–578. Springer, 2016.
- Realtime multi-person 2d pose estimation using part affinity fields. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7291–7299, 2017.
- Beat: the behavior expression animation toolkit. In Proceedings of the 28th annual conference on Computer graphics and interactive techniques, pages 477–486, 2001.
- Rethinking atrous convolution for semantic image segmentation. arXiv, 2017.
- Exploring simple siamese representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15750–15758, 2021.
- Monocular expressive body regression through body-driven attention. In European Conference on Computer Vision (ECCV), pages 20–40, 2020. URL https://expose.is.tue.mpg.de.
- Capture, learning, and synthesis of 3D speaking styles. In Proceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages 10101–10111, 2019.
- Arcface: Additive angular margin loss for deep face recognition. In Computer Vision and Pattern Recognition (CVPR), pages 4690–4699, 2019a.
- Accurate 3d face reconstruction with weakly-supervised learning: From single image to image set. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 0–0, 2019b.
- Faceformer: Speech-driven 3d facial animation with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18770–18780, 2022.
- A 3-D Audio-Visual Corpus of Affective Communication. IEEE Transactions on Multimedia, 12(6), October 2010. ISSN 1941-0077. doi: 10.1109/TMM.2010.2052239.
- Learning analytical posterior probability for human mesh recovery. In Computer Vision and Pattern Recognition (CVPR), June 2023.
- Collaborative regression of expressive bodies using moderation. In 2021 International Conference on 3D Vision (3DV), pages 792–804. IEEE, 2021a.
- Learning an animatable detailed 3d face model from in-the-wild images. ACM Transactions on Graphics (TOG), 40(4):1–13, 2021b.
- Investigating the use of recurrent motion modelling for speech gesture generation. In Proceedings of the 18th International Conference on Intelligent Virtual Agents, pages 93–98, 2018.
- Learning individual styles of conversational gesture. In Computer Vision and Pattern Recognition (CVPR). IEEE, June 2019.
- Susan Goldin-Meadow. The role of gesture in communication and thinking. Trends in cognitive sciences, 3(11):419–429, 1999.
- Learning speech-driven 3d conversational gestures from video. In Proceedings of the 21st ACM International Conference on Intelligent Virtual Agents, pages 101–108, 2021.
- Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning (ICML), pages 448–456, 2015.
- Total capture: A 3d deformation model for tracking faces, hands, and bodies. In Computer Vision and Pattern Recognition (CVPR), pages 8320–8329, 2018.
- End-to-end recovery of human shape and pose. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7122–7131, 2018.
- Audio-driven facial animation by joint end-to-end learning of pose and emotion. Transactions on Graphics (TOG), 36(4):1–12, 2017.
- Real-time facial surface geometry from monocular video on mobile gpus. arXiv, 2019.
- Adam Kendon. Gesture: Visible action as utterance. Cambridge University Press, 2004.
- Synthesizing multimodal utterances for conversational agents. Computer animation and virtual worlds, 15(1):39–52, 2004.
- Analyzing input and output representations for speech-driven gesture generation. In Proceedings of the 19th ACM International Conference on Intelligent Virtual Agents, pages 97–104, 2019.
- Gesticulator: A framework for semantically-aware speech-driven gesture generation. In Proceedings of the 2020 International Conference on Multimodal Interaction, pages 242–250, 2020.
- Real-time prosody-driven synthesis of body language. pages 1–10. 2009.
- Gesture controllers. pages 1–11. 2010.
- NIKI: Neural inverse kinematics with invertible neural networks for 3d human pose and shape estimation. In Computer Vision and Pattern Recognition (CVPR), June 2023.
- Audio2gestures: Generating diverse gestures from speech audio with conditional variational autoencoders. In Computer Vision and Pattern Recognition (CVPR), pages 11293–11302, 2021.
- Speech2video synthesis with 3d skeleton regularization and expressive body poses. In Proceedings of the Asian Conference on Computer Vision, 2020.
- Learning hierarchical cross-modal association for co-speech gesture generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10462–10472, 2022.
- Rectifier nonlinearities improve neural network acoustic models. In International Conference on Machine Learning (ICML), volume 30, page 3, 2013.
- Virtual character performance from speech. In Proceedings of the 12th ACM SIGGRAPH/Eurographics symposium on computer animation, pages 25–35, 2013.
- Learning to listen: Modeling non-deterministic dyadic facial motion. In Computer Vision and Pattern Recognition (CVPR), pages 20395–20405, 2022.
- Nonlinear equations. Numerical Optimization, pages 270–302, 2006.
- Expressive body capture: 3d hands, face, and body from a single image. In Computer Vision and Pattern Recognition (CVPR), pages 10975–10985, 2019.
- Greta. a believable embodied conversational agent. In Multimodal intelligent information presentation, pages 3–25. Springer, 2005.
- Smplpix: Neural avatars from 3d human models. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1810–1819, 2021.
- Speech drives templates: Co-speech gesture synthesis with learned templates. In Computer Vision and Pattern Recognition (CVPR), pages 11077–11086, 2021.
- Meshtalk: 3d face animation from speech using cross-modality disentanglement. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1173–1182, 2021.
- Meaningful head movements driven by emotional synthetic speech. Speech Communication, 95:87–99, 2017.
- Md Sahidullah and Goutam Saha. Design, analysis and experimental evaluation of block based transformation in mfcc computation for speaker recognition. Speech communication, 54(4):543–565, 2012.
- Bailando: 3d dance generation by actor-critic gpt with choreographic memory. In Computer Vision and Pattern Recognition (CVPR), pages 11050–11059, 2022.
- Trace: Temporal regression of 5d avatars with dynamic cameras in 3d environments. In Computer Vision and Pattern Recognition (CVPR), June 2023.
- Creating a gesture-speech dataset for speech-based automatic gesture generation. In International Conference on Human-Computer Interaction, pages 198–202. Springer, 2017.
- Conditional image generation with pixelcnn decoders. Conference on Neural Information Processing Systems (NeurIPS), 29, 2016.
- Neural discrete representation learning. In Conference on Neural Information Processing Systems (NeurIPS), volume 30, 2017.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Gesture and speech in interaction: An overview, 2014.
- Multiface: A dataset for neural face rendering. arXiv, 2022. doi: 10.48550/ARXIV.2207.11243. URL https://arxiv.org/abs/2207.11243.
- Ghum & ghuml: Generative 3d human shape and articulated pose models. In Computer Vision and Pattern Recognition (CVPR), pages 6184–6193, 2020.
- Freeform body motion generation from speech. arXiv, 2022.
- Gesture2vec: Clustering gestures using representation learning methods for co-speech gesture generation. In 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 3100–3107. IEEE, 2022.
- Mmface: A multi-metric regression network for unconstrained face reconstruction. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7663–7672, 2019.
- Human-aware object placement for visual environment reconstruction. In Computer Vision and Pattern Recognition (CVPR), 2022.
- Robots learn social skills: End-to-end learning of co-speech gesture generation for humanoid robots. In International Conference on Robotics and Automation. (ICRA), pages 4303–4309. IEEE, 2019.
- Speech gesture generation from the trimodal context of text, audio, and speaker identity. Transactions on Graphics (TOG), 39(6):1–16, 2020.
- Pymaf: 3d human pose and shape regression with pyramidal mesh alignment feedback loop. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11446–11456, 2021.
- Pymaf-x: Towards well-aligned full-body model regression from monocular images. arXiv, 2022.
- Visemenet: Audio-driven animator-centric speech animation. ACM Transactions on Graphics (TOG), 37(4):1–10, 2018.
- Makelttalk: speaker-aware talking-head animation. ACM Transactions On Graphics (TOG), 39(6):1–15, 2020.
- Towards metrical reconstruction of human faces. In European Conference on Computer Vision (ECCV), 2022.
- Hongwei Yi (28 papers)
- Hualin Liang (2 papers)
- Yifei Liu (43 papers)
- Qiong Cao (26 papers)
- Yandong Wen (24 papers)
- Timo Bolkart (36 papers)
- Dacheng Tao (829 papers)
- Michael J. Black (163 papers)