From Audio to Photoreal Embodiment: Synthesizing Humans in Conversations (2401.01885v1)
Abstract: We present a framework for generating full-bodied photorealistic avatars that gesture according to the conversational dynamics of a dyadic interaction. Given speech audio, we output multiple possibilities of gestural motion for an individual, including face, body, and hands. The key behind our method is in combining the benefits of sample diversity from vector quantization with the high-frequency details obtained through diffusion to generate more dynamic, expressive motion. We visualize the generated motion using highly photorealistic avatars that can express crucial nuances in gestures (e.g. sneers and smirks). To facilitate this line of research, we introduce a first-of-its-kind multi-view conversational dataset that allows for photorealistic reconstruction. Experiments show our model generates appropriate and diverse gestures, outperforming both diffusion- and VQ-only methods. Furthermore, our perceptual evaluation highlights the importance of photorealism (vs. meshes) in accurately assessing subtle motion details in conversational gestures. Code and dataset available online.
- To react or not to react: End-to-end visual pose forecasting for personalized avatar during dyadic conversations. In 2019 International Conference on Multimodal Interaction, pages 74–84, 2019.
- Listen, denoise, action! audio-driven motion synthesis with diffusion models. ACM Transactions on Graphics (TOG), 42(4):1–20, 2023.
- Gesturediffuclip: Gesture diffusion model with clip latents. arXiv preprint arXiv:2303.14613, 2023.
- wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems, 33:12449–12460, 2020.
- Driving-signal aware full-body avatars. ACM Trans. Graph., 40(4), 2021.
- Facilitating multiparty dialog with gaze, gesture, and speech. In International Conference on Multimodal Interfaces and the Workshop on Machine Learning for Multimodal Interaction, ICMI-MLMI ’10, New York, NY, USA, 2010. Association for Computing Machinery.
- Animated conversation: Rule-based generation of facial expression, gesture and spoken intonation for multiple conversational agents. In Proceedings of the 21st Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH ’94, New York, NY, USA, 1994. Association for Computing Machinery.
- Anthropomorphism influences perception of computer-animated characters’ actions. Social cognitive and affective neuroscience, 2(3):206–216, 2007.
- Capture, learning, and synthesis of 3D speaking styles. In Proceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages 10101–10111, 2019.
- Nonverbal leakage and clues to deception. Psychiatry, 32(1):88–106, 1969.
- Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12873–12883, 2021.
- Learn2smile: Learning non-verbal interaction through observation. IROS, 2017.
- Learning individual styles of conversational gesture. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3497–3506, 2019.
- Virtual rapport. In International Workshop on Intelligent Virtual Agents, pages 14–27. Springer, 2006.
- Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
- Classifier-free diffusion guidance. NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021.
- The curious case of neural text degeneration. arXiv preprint arXiv:1904.09751, 2019.
- Virtual rapport 2.0. In International workshop on intelligent virtual agents, pages 68–79. Springer, 2011.
- Dyadgan: Generating facial expressions in dyadic interactions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 11–18, 2017.
- Audio-driven emotional video portraits. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14080–14089, 2021.
- Learning non-verbal behavior for a social robot from youtube videos. In ICDL-EpiRob Workshop on Naturalistic Non-Verbal and Affective Human-Robot Interactions, Oslo, Norway, August 19, 2019, 2019.
- Towards social artificial intelligence: Nonverbal social signal prediction in a triadic interaction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10873–10883, 2019.
- Talking with hands 16.2 m: A large-scale dataset of synchronized body-finger motion and audio for conversational motion analysis and synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 763–772, 2019.
- Beat: A large-scale semantic and emotional multi-modal dataset for conversational gestures synthesis. arXiv preprint arXiv:2203.05297, 2022.
- Deep appearance models for face rendering. ACM Trans. on Graphics, 37(4), 2018.
- Render me real? investigating the effect of render style on the perception of animated virtual humans. ACM Transactions on Graphics (TOG), 31(4):1–11, 2012.
- Body2hands: Learning to infer 3d hands from conversational gesture body dynamics. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11865–11874, 2021.
- Learning to listen: Modeling non-deterministic dyadic facial motion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20395–20405, 2022.
- Can language models learn to listen? In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023.
- Improved denoising diffusion probabilistic models. In International Conference on Machine Learning, pages 8162–8171, 2021.
- Interactive generative adversarial networks for facial expression generation in dyadic interactions. arXiv preprint arXiv:1801.09092, 2018.
- Meshtalk: 3d face animation from speech using cross-modality disentanglement. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021.
- Social diffusion: Long-term multiple human motion anticipation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9601–9611, 2023.
- Human motion diffusion model. arXiv preprint arXiv:2209.14916, 2022.
- Neural voice puppetry: Audio-driven facial reenactment. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVI 16, pages 716–731. Springer, 2020.
- Edge: Editable dance generation from music. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 448–458, 2023.
- Neural discrete representation learning. Advances in neural information processing systems, 30, 2017.
- A review of vector quantization techniques. IEEE Potentials, 25(4):39–47, 2006.
- Realistic speech-driven facial animation with gans. International Journal of Computer Vision, 128:1398–1413, 2020.
- Generating holistic 3d human motion from speech. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 469–480, 2023.
- Talking head generation with probabilistic audio-to-visual diffusion priors. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7645–7655, 2023.
- Soundstream: An end-to-end neural audio codec. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30:495–507, 2021.
- Apb2face: Audio-guided face reenactment with auxiliary pose and blink signals. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4402–4406. IEEE, 2020.
- Livelyspeaker: Towards semantic-aware co-speech gesture generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 20807–20817, October 2023.
- Responsive listening head generation: A benchmark dataset and baseline. In ECCV, 2022.
- Evonne Ng (8 papers)
- Javier Romero (35 papers)
- Timur Bagautdinov (22 papers)
- Shaojie Bai (21 papers)
- Trevor Darrell (324 papers)
- Angjoo Kanazawa (84 papers)
- Alexander Richard (33 papers)