Exploring the Task-agnostic Trait of Self-supervised Learning in the Context of Detecting Mental Disorders (2403.15170v1)
Abstract: Self-supervised learning (SSL) has been investigated to generate task-agnostic representations across various domains. However, such investigation has not been conducted for detecting multiple mental disorders. The rationale behind the existence of a task-agnostic representation lies in the overlapping symptoms among multiple mental disorders. Consequently, the behavioural data collected for mental health assessment may carry a mixed bag of attributes related to multiple disorders. Motivated by that, in this study, we explore a task-agnostic representation derived through SSL in the context of detecting major depressive disorder (MDD) and post-traumatic stress disorder (PTSD) using audio and video data collected during interactive sessions. This study employs SSL models trained by predicting multiple fixed targets or masked frames. We propose a list of fixed targets to make the generated representation more efficient for detecting MDD and PTSD. Furthermore, we modify the hyper-parameters of the SSL encoder predicting fixed targets to generate global representations that capture varying temporal contexts. Both these innovations are noted to yield improved detection performances for considered mental disorders and exhibit task-agnostic traits. In the context of the SSL model predicting masked frames, the generated global representations are also noted to exhibit task-agnostic traits.
- J. Gratch, R. Arstein, G. Lucas, G. Stratou, S. Scherer, A. Nazarian, R. Wood, J. Boberg, D. DeVault, S. Marsella, D. Traum, A. Rizzo, and L. Morency, “The distress analysis interview corpus of human and computer interviews,” in Proc. of the International Conference on Language Resources and Evaluation, 2014, pp. 3123–3128.
- A. Thieme, D. Belgrave, and G. Doherty, “Machine learning in mental health: A systematic review of the HCI literature to support the development of effective and implementable ML systems,” ACM Transactions on Computer-Human Interaction, vol. 27(5), 2020.
- K. Schultebraucks, V. Yadav, A. Y. Shalev, G. A. Bonanno, and I. R. Galatzer-Levy, “Deep learning-based classification of posttraumatic stress disorder and depression following trauma utilizing visual and auditory markers of arousal and mood,” Psychological Medicine, vol. 52, no. 5, p. 957–967, 2020.
- S. Scherer, G. Stratou, J. Gratch, and L. Morency, “Investigating voice quality as a speaker-independent indicator of depression and PTSD,” in Proc. of the Annual Conference of the International Speech Communication Association (INTERSPEECH), 2013, pp. 847–851.
- M. Valstar, B. Schuller, K. Smith, T. Almaev, F. Eyben, J. Krajewski, R. Cowie, and M. Pantic, “AVEC 2014: 3D dimensional affect and depression recognition challenge,” in Proc. of the 4th International Workshop on Audio/Visual Emotion Challenge, 2014, p. 3–10.
- M. Valstar, J. Gratch, B. Schuller, F. Ringeval, R. Cowie, and M. Pantic, “AVEC 2016: Depression, mood, and emotion recognition workshop and challenge,” in Proc. of the 6th International Workshop on Audio/Visual Emotion Challenge, 2016, pp. 1483–1484.
- K. Grisanzio, A. Goldstein, M. Wang, A. Ahmed, Z. Samara, and L. Williams, “Transdiagnostic symptom clusters and associations with brain, behavior, and daily function in mood, anxiety, and trauma disorders,” Journal of the American Medical Association (JAMA) Psychiatry, vol. 75, 12 2017.
- X. Ma, H. Yang, Q. Chen, D. Huang, and Y. Wang, “DepAudioNet: An efficient deep model for audio based depression classification,” in Proc. of the 6th International Workshop on Audio/Visual Emotion Challenge, 2016, p. 35–42.
- T. Hanai, M. Ghassemi, and J. Glass, “Detecting depression with audio/text sequence modeling of interviews,” in Proc. of the Annual Conference of the International Speech Communication Association (INTERSPEECH), 2018, pp. 1716–1720.
- Z. Huang, J. Epps, and D. Joachim, “Exploiting vocal tract coordination using dilated cnns for depression detection in naturalistic environments,” in Proc. of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 6549–6553.
- R. K. Gupta and R. Sinha, “Investigating the effect of data impurity on the detection performances of mental disorders through spoken dialogues,” in Proc. of the Speech and Computer (SPECOM), A. Karpov, K. Samudravijaya, K. T. Deepak, R. M. Hegde, S. S. Agrawal, and S. R. M. Prasanna, Eds., 2023, pp. 626–637.
- A. Mohamed, H. yi Lee, L. Borgholt, J. Havtorn, J. Edin, C. Igel, K. Kirchhoff, S.-W. Li, K. Livescu, L. Maaløe, T. Sainath, and S. Watanabe, “Self-supervised speech representation learning: A review,” IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1179–1210, 2022.
- C. Doersch and A. Zisserman, “Multi-task self-supervised visual learning,” in Proc. of the IEEE International Conference on Computer Vision (ICCV), 2017, pp. 2070–2079.
- A. van den Oord, Y. Li, and O. Vinyals, “Representation learning with contrastive predictive coding,” ArXiv, vol. abs/1807.03748, 2018.
- S. Pascual, M. Ravanelli, J. Serrà, A. Bonafonte, and Y. Bengio, “Learning problem-agnostic speech representations from multiple self-supervised tasks,” in Proc. of the Annual Conference of the International Speech Communication Association (INTERSPEECH), 2019, pp. 161–165.
- J. Shor, A. Jansen, R. Maor, O. Lang, O. Tuval, F. de Chaumont Quitry, M. Tagliasacchi, I. Shavitt, D. Emanuel, and Y. Haviv, “Towards learning a universal non-semantic representation of speech,” in Proc. of the Annual Conference of the International Speech Communication Association (INTERSPEECH), 2020, pp. 140–144.
- S. H. Dumpala, C. S. Sastry, R. Usher, and S. Oore, “On combining global and localized self-supervised models of speech,” in Proc. of the Annual Conference of the International Speech Communication Association (INTERSPEECH), 2022, pp. 3593–3597.
- A. Liu, S.-W. Yang, P.-H. Chi, P.-c. Hsu, and H.-y. Lee, “Mockingjay: Unsupervised speech representation learning with deep bidirectional transformer encoders,” in Proc. of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 6419–6423.
- P.-H. Chi, P.-H. Chung, T.-H. Wu, C.-C. Hsieh, Y.-H. Chen, S.-W. Li, and H.-y. Lee, “Audio albert: A lite bert for self-supervised learning of audio representation,” in Proc. of the IEEE Spoken Language Technology Workshop (SLT), 2021, pp. 344–350.
- M. Ravanelli and Y. Bengio, “Speaker recognition from raw waveform with sincnet,” in Proc. of the IEEE Spoken Language Technology Workshop (SLT), 2018, pp. 1021–1028.
- D. DeVault, R. Artstein, G. Benn, T. Dey, E. Fast, A. Gainer, K. Georgila, J. Gratch, A. Hartholt, M. Lhommet, G. Lucas, S. Marsella, F. Morbini, A. Nazarian, S. Scherer, G. Stratou, A. Suri, D. Traum, R. Wood, Y. Xu, A. Rizzo, and L. P. Morency, “Simsensei kiosk: A virtual human interviewer for healthcare decision support,” in Proc. of the International Conference on Autonomous Agents and Multi-Agent Systems, 2014, p. 1061–1068.
- K. Kroenke, T. W. Strine, R. L. Spitzer, J. B.W. Williams, J. T. Berry, and A. H. Mokdad, “The PHQ-8 as a measure of current depression in the general population,” Journal of Affective Disorders, vol. 114(1-3), pp. 163–173, 2009.
- M. Andrykowski, M. Cordova, J. Studts, and T. Miller, “Posttraumatic stress disorder after treatment for breast cancer: Prevalence of diagnosis and use of the PTSD checklist – civilian version (PCL–C) as a screening instrument,” Journal of Consulting and Clinical Psychology, vol. 66(3), pp. 586–90, 1998.
- T. Baltrušaitis, P. Robinson, and L.-P. Morency, “OpenFace: An open source facial behavior analysis toolkit,” in Proc. of the IEEE Winter Conference on Applications of Computer Vision (WACV), 2016, pp. 1–10.
- C. Busso, M. Bulut, C.-C. Lee, A. Kazemzadeh, E. Mower Provost, S. Kim, J. Chang, S. Lee, and S. Narayanan, “IEMOCAP: Interactive emotional dyadic motion capture database,” Language Resources and Evaluation, vol. 42, no. 12, pp. 335–359, 2008.
- V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: An asr corpus based on public domain audio books,” in Proc. of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015, pp. 5206–5210.
- A. Bailey and M. Plumbley, “Gender bias in depression detection using audio features,” in Proc. of the 29th European Signal Processing Conference (EUSIPCO), 08 2021, pp. 596–600.