Learning to Generate Context-Sensitive Backchannel Smiles for Embodied AI Agents with Applications in Mental Health Dialogues (2402.08837v1)
Abstract: Addressing the critical shortage of mental health resources for effective screening, diagnosis, and treatment remains a significant challenge. This scarcity underscores the need for innovative solutions, particularly in enhancing the accessibility and efficacy of therapeutic support. Embodied agents with advanced interactive capabilities emerge as a promising and cost-effective supplement to traditional caregiving methods. Crucial to these agents' effectiveness is their ability to simulate non-verbal behaviors, like backchannels, that are pivotal in establishing rapport and understanding in therapeutic contexts but remain under-explored. To improve the rapport-building capabilities of embodied agents we annotated backchannel smiles in videos of intimate face-to-face conversations over topics such as mental health, illness, and relationships. We hypothesized that both speaker and listener behaviors affect the duration and intensity of backchannel smiles. Using cues from speech prosody and language along with the demographics of the speaker and listener, we found them to contain significant predictors of the intensity of backchannel smiles. Based on our findings, we introduce backchannel smile production in embodied agents as a generation problem. Our attention-based generative model suggests that listener information offers performance improvements over the baseline speaker-centric generation approach. Conditioned generation using the significant predictors of smile intensity provides statistically significant improvements in empirical measures of generation quality. Our user study by transferring generated smiles to an embodied agent suggests that agent with backchannel smiles is perceived to be more human-like and is an attractive alternative for non-personal conversations over agent without backchannel smiles.
- Exploring barriers to mental health care in the u.s. (2022). doi:10.15766/rai_a3ewcf9p.
- Spectral representation of behaviour primitives for depression analysis, IEEE Transactions on Affective Computing 13 (2020) 829–844.
- F. Ceccarelli, M. Mahmoud, Multimodal temporal machine learning for bipolar disorder and depression recognition, Pattern Analysis and Applications 25 (2022) 493–504.
- Detecting depression severity from vocal prosody, IEEE transactions on affective computing 4 (2012) 142–150.
- Simsensei kiosk: A virtual human interviewer for healthcare decision support, in: Proceedings of the 2014 international conference on Autonomous agents and multi-agent systems, 2014, pp. 1061–1068.
- All smiles are not created equal: Morphology and timing of smiles perceived as amused, polite, and embarrassed/nervous, Journal of nonverbal behavior 33 (2009) 17–34.
- D. Utami, T. Bickmore, Collaborative user responses in multiparty interaction with a couples counselor robot, in: 2019 14th ACM/IEEE International Conference on Human-Robot Interaction (HRI), IEEE, 2019, pp. 294–303.
- N. Ward, W. Tsukahara, Prosodic features which cue back-channel responses in english and japanese, Journal of pragmatics 32 (2000) 1177–1207.
- The prosody of backchannels in american english (2007).
- Backchannels revisited from a multimodal perspective, in: Auditory-visual Speech Processing, 2007, pp. 1–5.
- A multimodal analysis of vocal and visual backchannels in spontaneous dialogs., in: INTERSPEECH, 2011, pp. 2973–2976.
- A. Gravano, J. Hirschberg, Backchannel-inviting cues in task-oriented dialogue, in: Tenth Annual Conference of the International Speech Communication Association, 2009.
- Every smile is unique: Landmark-guided diverse smile generation, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 7083–7092.
- Learn2smile: Learning non-verbal interaction through observation, in: 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), IEEE, 2017, pp. 4131–4138.
- Learning to listen: Modeling non-deterministic dyadic facial motion, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 20395–20405.
- Affective faces for goal-driven dyadic communication, arXiv preprint arXiv:2301.10939 (2023).
- Afar: A deep learning based tool for automated facial affect recognition, in: 2019 14th IEEE international conference on automatic face & gesture recognition (FG 2019), IEEE, 2019, pp. 1–1.
- wav2vec: Unsupervised pre-training for speech recognition, arXiv preprint arXiv:1904.05862 (2019).
- Montreal Forced Aligner: Trainable Text-Speech Alignment Using Kaldi, in: Proc. Interspeech 2017, 2017, pp. 498–502. doi:10.21437/Interspeech.2017-1386.
- S. A. Memon, Acoustic correlates of the voice qualifiers: A survey, arXiv preprint arXiv:2010.15869 (2020).
- Opensmile: the munich versatile and fast open-source audio feature extractor, in: Proceedings of the 18th ACM international conference on Multimedia, 2010, pp. 1459–1462.
- E. Ekstedt, G. Skantze, Voice activity projection: Self-supervised learning of turn-taking events, arXiv preprint arXiv:2205.09812 (2022).
- Cnn architectures for large-scale audio classification, in: 2017 ieee international conference on acoustics, speech and signal processing (icassp), IEEE, 2017, pp. 131–135.
- Neural machine translation by jointly learning to align and translate, arXiv preprint arXiv:1409.0473 (2014).
- Sign language production using neural machine translation and generative adversarial networks, in: Proceedings of the 29th British Machine Vision Conference (BMVC 2018), British Machine Vision Association, 2018.
- No gestures left behind: Learning relationships between spoken language and freeform gestures, in: Findings of the Association for Computational Linguistics: EMNLP 2020, 2020, pp. 1884–1895.
- Furhat: a back-projected human-like robot head for multiparty human-machine interaction, in: Cognitive Behavioural Systems: COST 2102 International Training School, Dresden, Germany, February 21-26, 2011, Revised Selected Papers, Springer, 2012, pp. 114–130.