Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
117 tokens/sec
GPT-4o
8 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ReNeLiB: Real-time Neural Listening Behavior Generation for Socially Interactive Agents (2402.08079v1)

Published 12 Feb 2024 in cs.HC

Abstract: Flexible and natural nonverbal reactions to human behavior remain a challenge for socially interactive agents (SIAs) that are predominantly animated using hand-crafted rules. While recently proposed machine learning based approaches to conversational behavior generation are a promising way to address this challenge, they have not yet been employed in SIAs. The primary reason for this is the lack of a software toolkit integrating such approaches with SIA frameworks that conforms to the challenging real-time requirements of human-agent interaction scenarios. In our work, we for the first time present such a toolkit consisting of three main components: (1) real-time feature extraction capturing multi-modal social cues from the user; (2) behavior generation based on a recent state-of-the-art neural network approach; (3) visualization of the generated behavior supporting both FLAME-based and Apple ARKit-based interactive agents. We comprehensively evaluate the real-time performance of the whole framework and its components. In addition, we introduce pre-trained behavioral generation models derived from psychotherapy sessions for domain-specific listening behaviors. Our software toolkit, pivotal for deploying and assessing SIAs' listening behavior in real-time, is publicly available. Resources, including code, behavioural multi-modal features extracted from therapeutic interactions, are hosted at https://daksitha.github.io/ReNeLib

Definition Search Book Streamline Icon: https://streamlinehq.com
References (39)
  1. White paper – SEMLA. https://semla.dfki.de/white-paper/. (Accessed on 04/29/2023).
  2. OpenFace 2.0: Facial Behavior Analysis Toolkit. In 2018 13th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2018). 59–66. https://doi.org/10.1109/FG.2018.00019
  3. Timothy Bickmore and Justine Cassell. 1999. Small talk and conversational storytelling in embodied conversational interface agents. In AAAI fall symposium on narrative intelligence. 87–92.
  4. Dan Bohus and Eric Horvitz. 2010. Facilitating Multiparty Dialog with Gaze, Gesture, and Speech. In International Conference on Multimodal Interfaces and the Workshop on Machine Learning for Multimodal Interaction (Beijing, China) (ICMI-MLMI ’10). Association for Computing Machinery, New York, NY, USA, Article 5, 8 pages. https://doi.org/10.1145/1891903.1891910
  5. Animated Conversation: Rule-Based Generation of Facial Expression, Gesture & Spoken Intonation for Multiple Conversational Agents. In Proceedings of the 21st Annual Conference on Computer Graphics and Interactive Techniques (SIGGRAPH ’94). Association for Computing Machinery, New York, NY, USA, 413–420. https://doi.org/10.1145/192161.192272
  6. BEAT: The Behavior Expression Animation Toolkit. Proceedings of the 28th Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH 2001, 477–486. https://doi.org/10.1145/383259.383315
  7. EMOCA: Emotion Driven Monocular Face Capture and Animation. In Conference on Computer Vision and Pattern Recognition (CVPR). 20311–20322.
  8. Learning an Animatable Detailed 3D Face Model from In-The-Wild Images. CoRR abs/2012.04012 (2020). arXiv:2012.04012 https://arxiv.org/abs/2012.04012
  9. Epic Gaming. [n. d.]. MetaHuman - Unreal Engine. https://www.unrealengine.com/en-US/metahuman. (Accessed on 04/29/2023).
  10. Visual SceneMaker-a tool for authoring interactive virtual characters. Journal on Multimodal User Interfaces 6 (7 2012), 3–11. Issue 1-2. https://doi.org/10.1007/s12193-011-0077-1
  11. Charamel GmbH. [n. d.]. VuppetMaster® - interaktive 3D Avatare für Websites und Applikationen. https://vuppetmaster.de/. (Accessed on 01/06/2023).
  12. Virtual Rapport. 14–27. https://doi.org/10.1007/11821830_2
  13. MoGlow: Probabilistic and controllable motion synthesis using normalising flows. ACM Transactions on Graphics 39 (11 2020). Issue 6. https://doi.org/10.1145/3414685.3417836
  14. Virtual Rapport 2.0. 68–79. https://doi.org/10.1007/978-3-642-23974-8_8
  15. Alphabet Inc. [n. d.]. google/mediapipe: Cross-platform, customizable ML solutions for live and streaming media. https://github.com/google/mediapipe. (Accessed on 04/26/2023).
  16. Apple Inc. 2022. Apple AR blendShapes. https://developer.apple.com/documentation/arkit/arfaceanchor/2928251-blendshapes. Accessed: 2022.
  17. Kristiina Jokinen and Graham Wilcock. 2014. Multimodal Open-Domain Conversations with the Nao Robot. 213–224. https://doi.org/10.1007/978-1-4614-8280-2_19
  18. Let’s Face It: Probabilistic Multi-Modal Interlocutor-Aware Generation of Facial Gestures in Dyadic Settings. In Proceedings of the 20th ACM International Conference on Intelligent Virtual Agents (Virtual Event, Scotland, UK) (IVA ’20). Association for Computing Machinery, New York, NY, USA, Article 31, 8 pages. https://doi.org/10.1145/3383652.3423911
  19. Competitive Learning of Facial Fitting and Synthesis Using UV Energy. IEEE Transactions on Systems, Man, and Cybernetics: Systems 52, 5 (2022), 2858–2873. https://doi.org/10.1109/TSMC.2021.3054677
  20. Towards a common framework for multimodal generation: The behavior markup language. In Intelligent Virtual Agents: 6th International Conference, IVA 2006, Marina Del Rey, CA, USA, August 21-23, 2006. Proceedings 6. Springer, 205–217.
  21. Gesticulator: A Framework for Semantically-Aware Speech-Driven Gesture Generation. In Proceedings of the 2020 International Conference on Multimodal Interaction (Virtual Event, Netherlands) (ICMI ’20). Association for Computing Machinery, New York, NY, USA, 242–250. https://doi.org/10.1145/3382507.3418815
  22. Learning a model of facial shape and expression from 4D scans. ACM Transactions on Graphics, (Proc. SIGGRAPH Asia) 36, 6 (2017), 194:1–194:17. https://doi.org/10.1145/3130800.3130813
  23. 3D multiscale physiological human. Springer.
  24. Socially-Aware Animated Intelligent Personal Assistant Agent. In Proceedings of the 17th Annual Meeting of the Special Interest Group on Discourse and Dialogue. Association for Computational Linguistics, Los Angeles, 224–227. https://doi.org/10.18653/v1/W16-3628
  25. Socially-Aware Animated Intelligent Personal Assistant Agent. 224–227. https://doi.org/10.18653/v1/W16-3628
  26. Learning to Listen: Modeling Non-Deterministic Dyadic Facial Motion. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2022).
  27. Ana Paiva. 2000. Affective Interactions: Towards a New Generation of Computer Interfaces. https://doi.org/10.1007/10720296
  28. Facial Affective Behavior in Mental Disorder. Journal of Nonverbal Behavior 39 (12 2015), 371–396. Issue 4. https://doi.org/10.1007/s10919-015-0216-6
  29. Catherine Pelachaud. 2009. Modelling Multimodal Expression of Emotion in a Virtual Agent. Philosophical transactions of the Royal Society of London. Series B, Biological sciences 364 (12 2009), 3539–48. https://doi.org/10.1098/rstb.2009.0186
  30. Multimodal Behavior Modeling for Socially Interactive Agents (1 ed.). Association for Computing Machinery, New York, NY, USA, Chapter 1, 259–310. https://doi.org/10.1145/3477322.3477331
  31. Fabian Ramseyer and Wolfgang Tschacher. 2014. Nonverbal synchrony of head-and body-movement in psychotherapy: different signals have different associations with outcome. Frontiers in psychology 5 (2014), 979.
  32. Byron Reeves and Clifford Nass. 1996. The Media Equation: How People Treat Computers, Television, and New Media like Real People and Places. Cambridge University Press, USA.
  33. Digital People - The Future of CX - Soul Machines. https://www.soulmachines.com/. (Accessed on 04/29/2023).
  34. Learning to Regress 3D Face Shape and Expression from an Image without 3D Supervision. In Proceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR). 7763–7772.
  35. Henning Schauenburg and Tilman Grande. 2000. Operationalisierte Psychodynamische Diagnostik — OPD. 55–73. https://doi.org/10.1007/978-3-7091-6767-0_4
  36. Audiovisual recognition of spontaneous interest within conversations. Proceedings of the 9th International Conference on Multimodal Interfaces, ICMI’07, 30–37. https://doi.org/10.1145/1322192.1322201
  37. Face2Face: Real-time Face Capture and Reenactment of RGB Videos. arXiv:2007.14808 [cs.CV]
  38. Facsvatar: An Open Source Modular Framework for Real-Time FACS based Facial Animation. Proceedings of the 18th International Conference on Intelligent Virtual Agents, IVA 2018, 159–164. https://doi.org/10.1145/3267851.3267918
  39. Nigel Ward and Wataru Tsukahara. 2000. Tsukahara, W.: Prosodic features which cue back-channel responses in english and japanese. Journal of Pragmatics 23, 1177-1207. Journal of Pragmatics 32 (07 2000), 1177–1207. https://doi.org/10.1016/S0378-2166(99)00109-5
Citations (1)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets