ReNeLiB: Real-time Neural Listening Behavior Generation for Socially Interactive Agents (2402.08079v1)
Abstract: Flexible and natural nonverbal reactions to human behavior remain a challenge for socially interactive agents (SIAs) that are predominantly animated using hand-crafted rules. While recently proposed machine learning based approaches to conversational behavior generation are a promising way to address this challenge, they have not yet been employed in SIAs. The primary reason for this is the lack of a software toolkit integrating such approaches with SIA frameworks that conforms to the challenging real-time requirements of human-agent interaction scenarios. In our work, we for the first time present such a toolkit consisting of three main components: (1) real-time feature extraction capturing multi-modal social cues from the user; (2) behavior generation based on a recent state-of-the-art neural network approach; (3) visualization of the generated behavior supporting both FLAME-based and Apple ARKit-based interactive agents. We comprehensively evaluate the real-time performance of the whole framework and its components. In addition, we introduce pre-trained behavioral generation models derived from psychotherapy sessions for domain-specific listening behaviors. Our software toolkit, pivotal for deploying and assessing SIAs' listening behavior in real-time, is publicly available. Resources, including code, behavioural multi-modal features extracted from therapeutic interactions, are hosted at https://daksitha.github.io/ReNeLib
- White paper – SEMLA. https://semla.dfki.de/white-paper/. (Accessed on 04/29/2023).
- OpenFace 2.0: Facial Behavior Analysis Toolkit. In 2018 13th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2018). 59–66. https://doi.org/10.1109/FG.2018.00019
- Timothy Bickmore and Justine Cassell. 1999. Small talk and conversational storytelling in embodied conversational interface agents. In AAAI fall symposium on narrative intelligence. 87–92.
- Dan Bohus and Eric Horvitz. 2010. Facilitating Multiparty Dialog with Gaze, Gesture, and Speech. In International Conference on Multimodal Interfaces and the Workshop on Machine Learning for Multimodal Interaction (Beijing, China) (ICMI-MLMI ’10). Association for Computing Machinery, New York, NY, USA, Article 5, 8 pages. https://doi.org/10.1145/1891903.1891910
- Animated Conversation: Rule-Based Generation of Facial Expression, Gesture & Spoken Intonation for Multiple Conversational Agents. In Proceedings of the 21st Annual Conference on Computer Graphics and Interactive Techniques (SIGGRAPH ’94). Association for Computing Machinery, New York, NY, USA, 413–420. https://doi.org/10.1145/192161.192272
- BEAT: The Behavior Expression Animation Toolkit. Proceedings of the 28th Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH 2001, 477–486. https://doi.org/10.1145/383259.383315
- EMOCA: Emotion Driven Monocular Face Capture and Animation. In Conference on Computer Vision and Pattern Recognition (CVPR). 20311–20322.
- Learning an Animatable Detailed 3D Face Model from In-The-Wild Images. CoRR abs/2012.04012 (2020). arXiv:2012.04012 https://arxiv.org/abs/2012.04012
- Epic Gaming. [n. d.]. MetaHuman - Unreal Engine. https://www.unrealengine.com/en-US/metahuman. (Accessed on 04/29/2023).
- Visual SceneMaker-a tool for authoring interactive virtual characters. Journal on Multimodal User Interfaces 6 (7 2012), 3–11. Issue 1-2. https://doi.org/10.1007/s12193-011-0077-1
- Charamel GmbH. [n. d.]. VuppetMaster® - interaktive 3D Avatare für Websites und Applikationen. https://vuppetmaster.de/. (Accessed on 01/06/2023).
- Virtual Rapport. 14–27. https://doi.org/10.1007/11821830_2
- MoGlow: Probabilistic and controllable motion synthesis using normalising flows. ACM Transactions on Graphics 39 (11 2020). Issue 6. https://doi.org/10.1145/3414685.3417836
- Virtual Rapport 2.0. 68–79. https://doi.org/10.1007/978-3-642-23974-8_8
- Alphabet Inc. [n. d.]. google/mediapipe: Cross-platform, customizable ML solutions for live and streaming media. https://github.com/google/mediapipe. (Accessed on 04/26/2023).
- Apple Inc. 2022. Apple AR blendShapes. https://developer.apple.com/documentation/arkit/arfaceanchor/2928251-blendshapes. Accessed: 2022.
- Kristiina Jokinen and Graham Wilcock. 2014. Multimodal Open-Domain Conversations with the Nao Robot. 213–224. https://doi.org/10.1007/978-1-4614-8280-2_19
- Let’s Face It: Probabilistic Multi-Modal Interlocutor-Aware Generation of Facial Gestures in Dyadic Settings. In Proceedings of the 20th ACM International Conference on Intelligent Virtual Agents (Virtual Event, Scotland, UK) (IVA ’20). Association for Computing Machinery, New York, NY, USA, Article 31, 8 pages. https://doi.org/10.1145/3383652.3423911
- Competitive Learning of Facial Fitting and Synthesis Using UV Energy. IEEE Transactions on Systems, Man, and Cybernetics: Systems 52, 5 (2022), 2858–2873. https://doi.org/10.1109/TSMC.2021.3054677
- Towards a common framework for multimodal generation: The behavior markup language. In Intelligent Virtual Agents: 6th International Conference, IVA 2006, Marina Del Rey, CA, USA, August 21-23, 2006. Proceedings 6. Springer, 205–217.
- Gesticulator: A Framework for Semantically-Aware Speech-Driven Gesture Generation. In Proceedings of the 2020 International Conference on Multimodal Interaction (Virtual Event, Netherlands) (ICMI ’20). Association for Computing Machinery, New York, NY, USA, 242–250. https://doi.org/10.1145/3382507.3418815
- Learning a model of facial shape and expression from 4D scans. ACM Transactions on Graphics, (Proc. SIGGRAPH Asia) 36, 6 (2017), 194:1–194:17. https://doi.org/10.1145/3130800.3130813
- 3D multiscale physiological human. Springer.
- Socially-Aware Animated Intelligent Personal Assistant Agent. In Proceedings of the 17th Annual Meeting of the Special Interest Group on Discourse and Dialogue. Association for Computational Linguistics, Los Angeles, 224–227. https://doi.org/10.18653/v1/W16-3628
- Socially-Aware Animated Intelligent Personal Assistant Agent. 224–227. https://doi.org/10.18653/v1/W16-3628
- Learning to Listen: Modeling Non-Deterministic Dyadic Facial Motion. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2022).
- Ana Paiva. 2000. Affective Interactions: Towards a New Generation of Computer Interfaces. https://doi.org/10.1007/10720296
- Facial Affective Behavior in Mental Disorder. Journal of Nonverbal Behavior 39 (12 2015), 371–396. Issue 4. https://doi.org/10.1007/s10919-015-0216-6
- Catherine Pelachaud. 2009. Modelling Multimodal Expression of Emotion in a Virtual Agent. Philosophical transactions of the Royal Society of London. Series B, Biological sciences 364 (12 2009), 3539–48. https://doi.org/10.1098/rstb.2009.0186
- Multimodal Behavior Modeling for Socially Interactive Agents (1 ed.). Association for Computing Machinery, New York, NY, USA, Chapter 1, 259–310. https://doi.org/10.1145/3477322.3477331
- Fabian Ramseyer and Wolfgang Tschacher. 2014. Nonverbal synchrony of head-and body-movement in psychotherapy: different signals have different associations with outcome. Frontiers in psychology 5 (2014), 979.
- Byron Reeves and Clifford Nass. 1996. The Media Equation: How People Treat Computers, Television, and New Media like Real People and Places. Cambridge University Press, USA.
- Digital People - The Future of CX - Soul Machines. https://www.soulmachines.com/. (Accessed on 04/29/2023).
- Learning to Regress 3D Face Shape and Expression from an Image without 3D Supervision. In Proceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR). 7763–7772.
- Henning Schauenburg and Tilman Grande. 2000. Operationalisierte Psychodynamische Diagnostik — OPD. 55–73. https://doi.org/10.1007/978-3-7091-6767-0_4
- Audiovisual recognition of spontaneous interest within conversations. Proceedings of the 9th International Conference on Multimodal Interfaces, ICMI’07, 30–37. https://doi.org/10.1145/1322192.1322201
- Face2Face: Real-time Face Capture and Reenactment of RGB Videos. arXiv:2007.14808 [cs.CV]
- Facsvatar: An Open Source Modular Framework for Real-Time FACS based Facial Animation. Proceedings of the 18th International Conference on Intelligent Virtual Agents, IVA 2018, 159–164. https://doi.org/10.1145/3267851.3267918
- Nigel Ward and Wataru Tsukahara. 2000. Tsukahara, W.: Prosodic features which cue back-channel responses in english and japanese. Journal of Pragmatics 23, 1177-1207. Journal of Pragmatics 32 (07 2000), 1177–1207. https://doi.org/10.1016/S0378-2166(99)00109-5