Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

From Audio to Photoreal Embodiment: Synthesizing Humans in Conversations (2401.01885v1)

Published 3 Jan 2024 in cs.CV

Abstract: We present a framework for generating full-bodied photorealistic avatars that gesture according to the conversational dynamics of a dyadic interaction. Given speech audio, we output multiple possibilities of gestural motion for an individual, including face, body, and hands. The key behind our method is in combining the benefits of sample diversity from vector quantization with the high-frequency details obtained through diffusion to generate more dynamic, expressive motion. We visualize the generated motion using highly photorealistic avatars that can express crucial nuances in gestures (e.g. sneers and smirks). To facilitate this line of research, we introduce a first-of-its-kind multi-view conversational dataset that allows for photorealistic reconstruction. Experiments show our model generates appropriate and diverse gestures, outperforming both diffusion- and VQ-only methods. Furthermore, our perceptual evaluation highlights the importance of photorealism (vs. meshes) in accurately assessing subtle motion details in conversational gestures. Code and dataset available online.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (45)
  1. To react or not to react: End-to-end visual pose forecasting for personalized avatar during dyadic conversations. In 2019 International Conference on Multimodal Interaction, pages 74–84, 2019.
  2. Listen, denoise, action! audio-driven motion synthesis with diffusion models. ACM Transactions on Graphics (TOG), 42(4):1–20, 2023.
  3. Gesturediffuclip: Gesture diffusion model with clip latents. arXiv preprint arXiv:2303.14613, 2023.
  4. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems, 33:12449–12460, 2020.
  5. Driving-signal aware full-body avatars. ACM Trans. Graph., 40(4), 2021.
  6. Facilitating multiparty dialog with gaze, gesture, and speech. In International Conference on Multimodal Interfaces and the Workshop on Machine Learning for Multimodal Interaction, ICMI-MLMI ’10, New York, NY, USA, 2010. Association for Computing Machinery.
  7. Animated conversation: Rule-based generation of facial expression, gesture and spoken intonation for multiple conversational agents. In Proceedings of the 21st Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH ’94, New York, NY, USA, 1994. Association for Computing Machinery.
  8. Anthropomorphism influences perception of computer-animated characters’ actions. Social cognitive and affective neuroscience, 2(3):206–216, 2007.
  9. Capture, learning, and synthesis of 3D speaking styles. In Proceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages 10101–10111, 2019.
  10. Nonverbal leakage and clues to deception. Psychiatry, 32(1):88–106, 1969.
  11. Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12873–12883, 2021.
  12. Learn2smile: Learning non-verbal interaction through observation. IROS, 2017.
  13. Learning individual styles of conversational gesture. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3497–3506, 2019.
  14. Virtual rapport. In International Workshop on Intelligent Virtual Agents, pages 14–27. Springer, 2006.
  15. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
  16. Classifier-free diffusion guidance. NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021.
  17. The curious case of neural text degeneration. arXiv preprint arXiv:1904.09751, 2019.
  18. Virtual rapport 2.0. In International workshop on intelligent virtual agents, pages 68–79. Springer, 2011.
  19. Dyadgan: Generating facial expressions in dyadic interactions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 11–18, 2017.
  20. Audio-driven emotional video portraits. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14080–14089, 2021.
  21. Learning non-verbal behavior for a social robot from youtube videos. In ICDL-EpiRob Workshop on Naturalistic Non-Verbal and Affective Human-Robot Interactions, Oslo, Norway, August 19, 2019, 2019.
  22. Towards social artificial intelligence: Nonverbal social signal prediction in a triadic interaction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10873–10883, 2019.
  23. Talking with hands 16.2 m: A large-scale dataset of synchronized body-finger motion and audio for conversational motion analysis and synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 763–772, 2019.
  24. Beat: A large-scale semantic and emotional multi-modal dataset for conversational gestures synthesis. arXiv preprint arXiv:2203.05297, 2022.
  25. Deep appearance models for face rendering. ACM Trans. on Graphics, 37(4), 2018.
  26. Render me real? investigating the effect of render style on the perception of animated virtual humans. ACM Transactions on Graphics (TOG), 31(4):1–11, 2012.
  27. Body2hands: Learning to infer 3d hands from conversational gesture body dynamics. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11865–11874, 2021.
  28. Learning to listen: Modeling non-deterministic dyadic facial motion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20395–20405, 2022.
  29. Can language models learn to listen? In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023.
  30. Improved denoising diffusion probabilistic models. In International Conference on Machine Learning, pages 8162–8171, 2021.
  31. Interactive generative adversarial networks for facial expression generation in dyadic interactions. arXiv preprint arXiv:1801.09092, 2018.
  32. Meshtalk: 3d face animation from speech using cross-modality disentanglement. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021.
  33. Social diffusion: Long-term multiple human motion anticipation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9601–9611, 2023.
  34. Human motion diffusion model. arXiv preprint arXiv:2209.14916, 2022.
  35. Neural voice puppetry: Audio-driven facial reenactment. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVI 16, pages 716–731. Springer, 2020.
  36. Edge: Editable dance generation from music. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 448–458, 2023.
  37. Neural discrete representation learning. Advances in neural information processing systems, 30, 2017.
  38. A review of vector quantization techniques. IEEE Potentials, 25(4):39–47, 2006.
  39. Realistic speech-driven facial animation with gans. International Journal of Computer Vision, 128:1398–1413, 2020.
  40. Generating holistic 3d human motion from speech. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 469–480, 2023.
  41. Talking head generation with probabilistic audio-to-visual diffusion priors. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7645–7655, 2023.
  42. Soundstream: An end-to-end neural audio codec. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30:495–507, 2021.
  43. Apb2face: Audio-guided face reenactment with auxiliary pose and blink signals. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4402–4406. IEEE, 2020.
  44. Livelyspeaker: Towards semantic-aware co-speech gesture generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 20807–20817, October 2023.
  45. Responsive listening head generation: A benchmark dataset and baseline. In ECCV, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Evonne Ng (8 papers)
  2. Javier Romero (35 papers)
  3. Timur Bagautdinov (22 papers)
  4. Shaojie Bai (21 papers)
  5. Trevor Darrell (324 papers)
  6. Angjoo Kanazawa (84 papers)
  7. Alexander Richard (33 papers)
Citations (20)

Summary

  • The paper introduces a framework that uses audio-conditioned diffusion models and vector quantization to generate lifelike, gesture-rich avatars.
  • The methodology integrates multi-view conversational data to achieve high frame-rate rendering of full-body, facial, and hand expressions.
  • The research holds implications for improving virtual interactions in telepresence and online education while addressing privacy and long-range synthesis challenges.

Overview of Synthesizing Full-Bodied Photorealistic Avatars

In recent research, scientists have developed an innovative framework designed to create full-bodied, photorealistic avatars that gesture in response to the dynamics of a two-way conversation based solely on speech audio. This technology has the potential to improve the realism and expressiveness of digital human avatars, particularly in virtual communication scenarios.

The Science Behind Generating Dynamic Gestures

The methodology behind this breakthrough combines the diversity obtained from vector quantization with the detailed expressions afforded by diffusion models. This allows the avatars to exhibit a wide range of gestures and nuanced facial expressions (like subtle sneers or smirks) that are synchronized with spoken dialogue. The generated motion not only includes the entire body but also the face and hands, captured at a higher frame rate to convey more intricate movements.

To support this area of paper, the researchers have introduced a unique dataset, which is the first to offer multi-view conversational footage that enables photorealistic reconstruction. The experimental evaluations underscore the model's effectiveness in generating varied and fitting gestures, which surpass the performance of previous methods.

The Technology and Data

At the heart of this technology are two separate models: one for the face, leveraging an audio-conditioned diffusion model, and another for the body and hands, which uses an innovative combination of an autoregressive VQ-based method and a diffusion model. The personalized avatars are visualized through a neural renderer trained with multi-view capture data.

The researchers also compiled a new dataset to enable these advancements. The dataset consists of long-form dyadic interactions that cover a broad spectrum of emotions and conversational topics. Unlike previous datasets limited to skeletal or cartoon-like visualizations, this dataset can reconstruct individuals with photorealism, capturing the subtleties of real human interactions.

Implications and Applications

This technology has major implications for the future of virtual interaction systems. The ability to generate realistic avatars that respond naturally to audio cues can greatly enhance telepresence in technology such as virtual meetings, online education, and social VR. Additionally, the released dataset and code are set to further research into gesture generation with high-fidelity avatars, paving the way for more natural and immersive virtual experiences.

Reflecting on the Current Limitations

While this new method shows promising results in generating lifelike gestures for short audio segments, it is less adept at synthesizing movements that require a deep understanding of long-range conversational content. Additionally, the paper currently focuses on a small group of subjects for which consent has been granted, addressing privacy concerns while limiting the versatility of the avatars created. Despite these limitations, the project sets a new precedent in the development of photorealistic interactive avatars and poses essential questions about the future evaluation of such technology.