M3TCM: Multi-modal Multi-task Context Model for Utterance Classification in Motivational Interviews (2404.03312v1)
Abstract: Accurate utterance classification in motivational interviews is crucial to automatically understand the quality and dynamics of client-therapist interaction, and it can serve as a key input for systems mediating such interactions. Motivational interviews exhibit three important characteristics. First, there are two distinct roles, namely client and therapist. Second, they are often highly emotionally charged, which can be expressed both in text and in prosody. Finally, context is of central importance to classify any given utterance. Previous works did not adequately incorporate all of these characteristics into utterance classification approaches for mental health dialogues. In contrast, we present M3TCM, a Multi-modal, Multi-task Context Model for utterance classification. Our approach for the first time employs multi-task learning to effectively model both joint and individual components of therapist and client behaviour. Furthermore, M3TCM integrates information from the text and speech modality as well as the conversation context. With our novel approach, we outperform the state of the art for utterance classification on the recently introduced AnnoMI dataset with a relative improvement of 20% for the client- and by 15% for therapist utterance classification. In extensive ablation studies, we quantify the improvement resulting from each contribution.
- Comparative analyses of bert, roberta, distilbert, and xlnet for text-based emotion recognition. In 2020 17th International Computer Conference on Wavelet Active Media Technology and Information Processing (ICCWAMTIP), pages 117–121.
- Multimodal analysis of client behavioral change coding in motivational interviewing. In Proceedings of the 20th ACM International Conference on Multimodal Interaction, pages 356–360.
- Scaling up the evaluation of psychotherapy: evaluating motivational interviewing fidelity via statistical text classification. Implementation Science, 9(1):1–11.
- Agency context and tailored training in technology transfer: A pilot evaluation of motivational interviewing training for community counselors. Journal of substance abuse treatment, 37(2):191–202.
- Openface 2.0: Facial behavior analysis toolkit. In 2018 13th IEEE international conference on automatic face & gesture recognition (FG 2018), pages 59–66.
- Addressing alcohol use and problems in mandated college students: a randomized clinical trial using stepped care. Journal of consulting and clinical psychology, 80(6):1062.
- Chris R Brewin. 2006. Understanding cognitive behaviour therapy: A retrieval competition account. Behaviour research and therapy, 44(6):765–784.
- The noxi database: multimodal recordings of mediated novice-expert interactions. In Proceedings of the 19th ACM International Conference on Multimodal Interaction, pages 350–359.
- Computer versus in-person intervention for students violating campus alcohol policy. Journal of consulting and clinical psychology, 77(1):74.
- Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186.
- Quantifying the association between psychotherapy content and clinical outcomes using deep learning. JAMA psychiatry, 77(1):35–43.
- Understanding the relationship between patient language and outcomes in internet-enabled cognitive behavioural therapy: A deep learning approach to automatic coding of session transcripts. Psychotherapy Research, 31(3):300–312.
- Audio set: An ontology and human-labeled dataset for audio events. pages 776–780.
- Audio set: An ontology and human-labeled dataset for audio events. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 776–780.
- AST: Audio Spectrogram Transformer. In Proc. Interspeech 2021, pages 571–575.
- Comparative analysis of nlp models for detecting depression on twitter. In 2023 International Conference on Communications, Computing and Artificial Intelligence (CCCAI), pages 23–28.
- Predicting client’s inclination towards target behavior change in motivational interviewing and investigating the role of laughter. In Fifteenth Annual Conference of the International Speech Communication Association.
- Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, 9(8):1735–1780.
- Using conversation topics for predicting therapy outcomes in schizophrenia. Biomed Inform Insights, 6(Suppl 1):39–50.
- Tatsuya Ide and Daisuke Kawahara. 2021. Multi-task learning of generation and classification for emotion-aware dialogue response generation. In Proceedings of the NAACL Student Research Workshop.
- Dysfluency classification in stuttered speech using deep learning for real-time applications. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing.
- D. Kollias. 2022. Abaw: Learning from synthetic data & multi-task learning challenges. In European Conference on Computer Vision Workshops, pages 157–172.
- Multimodal phenotyping of psychiatric disorders from social interaction: Protocol of a clinical multicenter prospective study. Personalized Medicine in Psychiatry, 33:100094.
- Data augmentation for reliability and fairness in counselling quality classification. In 1st Workshop on Scarce Data in Artificial Intelligence for Healthcare-SDAIH, INSTICC; SciTePress: Setúbal, Portugal, pages 23–28.
- Focal loss for dense object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 2980–2988.
- Roberta: A robustly optimized bert pretraining approach.
- Emodm: Empathetic response generation with emotion-aware dialogue management.
- Ilya Loshchilov and Frank Hutter. 2019. Decoupled weight decay regularization. In Proceedings of the International Conference on Learning Representations.
- Introduction to Information Retrieval. Cambridge University Press.
- Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119.
- William R Miller and Stephen Rollnick. 2012. Motivational interviewing: Helping people change. Guilford press.
- Detecting change talk in motivational interviewing using verbal and facial information. In Proceedings of the ACM International Conference on Multimodal Interaction, pages 5–14.
- Building a motivational interviewing dataset. In Proceedings of the Third Workshop on Computational Linguistics and Clinical Psychology, pages 42–51.
- Karol J Piczak. 2015. Esc: Dataset for environmental sound classification. In Proceedings of the 23rd ACM international conference on Multimedia, pages 1015–1018.
- Sebastian Ruder. 2017. An overview of multi-task learning in deep neural networks. arXiv preprint arXiv:1706.05098.
- wav2vec: Unsupervised pre-training for speech recognition. arXiv preprint arXiv:1904.05862.
- Using prosodic and lexical information for learning utterance-level behaviors in psychotherapy. Interspeech, 2018:3413–3417.
- Multimodal automatic coding of client behavior in motivational interviewing. In Proceedings of the 2020 International Conference on Multimodal Interaction, pages 406–413.
- Deep learning for behaviour classification in a preclinical brain injury model. PLOS ONE, 17(4).
- Attention is all you need. Advances in neural information processing systems, 30.
- Pete Warden. 2018. Speech commands: A dataset for limited-vocabulary speech recognition. arXiv preprint arXiv:1804.03209.
- Renelib: Real-time neural listening behavior generation for socially interactive agents. In Proceedings of the 25th International Conference on Multimodal Interaction, pages 507–516.
- Creation, analysis and evaluation of annomi, a dataset of expert-annotated counselling dialogues. Future Internet, 15(3):110.
- Anno-mi: A dataset of expert-annotated counselling dialogues. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6177–6181.
- Are experts needed? on human evaluation of counselling reflection generation. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6906–6930.
- Towards automated counselling decision-making: Remarks on therapist action forecasting on the annomi dataset. Change, 25:17.
- A survey of multi-task learning in natural language processing: Regarding task relatedness and training methods. Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics.
- Sayed Muddashir Hossain (3 papers)
- Jan Alexandersson (5 papers)
- Philipp Müller (35 papers)